Big Data in Web Search. Claudio Lucchese hpc.isti.cnr.it

Transcription

1 Big Data in Web Search Claudio Lucchese hpc.isti.cnr.it

2 y High Performance Computing Lab 3 Post-doc 3 Research Associates from U. of Venice and Pisa l a b o fellows and r a t o r 7 Researchers 6 PhD students Claudio Lucchese Big Data in Web Search 2

3 Main Research Topics Web Search & Scalable DM/ML Responsiveness of large-scale search systems, storage, analysis and indexing of large amounts of data Machine learning and Web mining techniques for Ranking, Prediction, Recommendation, Diversification, Social media analysis, Entity Linking and Semantic Enrichment Cloud and Distributed computing Cloud federations, Resource Management l a b o r a t o r y Network overlays for P2P and Big Data Scalable data analysis with Hadoop Map-Reduce, Giraph, Spark, etc Claudio Lucchese Big Data in Web Search 3

4 Outline Some recent Learning to Rank activities User Task Discovery in Query Logs Lucchese, C., Orlando, S., Perego, R., Silvestri, F., & Tolomei, G. Discovering tasks from search engine query logs. ACM Transactions on Information Systems (TOIS), 31(3), ACM Notable Article. Entity Linking D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, S. Trani. Learning relatedness measures for entity linking. In Proceedings of CIKM '13: ACM Int. Conference on Information and Knowledge Management, p , Oct News Recommendation G. De Francisci Morales, A. Gionis, C. Lucchese. From Chatter to Headlines: Harnessing the Real-Time Web for Personalized News Recommendation. In Proceedings of WSDM '12: ACM Int. Conference on Web Search and Data Mining, Seattle, Washington, USA, February Tour planning Brilhante, I., Macedo, J. A., Nardini, F. M., Perego, R., & Renso, C. Where shall we go today?: planning touristic tours with tripbuilder. In Proceedings of the CIKM '13: ACM Int. Conference on Information and Knowledge Management, pp Oct Claudio Lucchese Big Data in Web Search 4

5 Ranking Ranking is (one of) the most important challenges in Web Search We define Ranking as the problem of sorting a set of documents according to their relevance to the user query. This is a typical Big Data task users feedback is very relevant Claudio Lucchese Big Data in Web Search 5

6 Learning to Rank is: q 1, d 11, r 11 q 1, d 12, r 12 q 1, d 1k, r 1k q m, d m1, r m1 q m, d mk, r mk Learning Scoring function h q *, d *,? Document Scoring q *, d *, h(q *, d * ) The goal is to learn the ranking, not the label! Claudio Lucchese Big Data in Web Search 6

7 Is it easy? q 1, d 11, r 11 q 1, d 12, r 12 q 1, d 1k, r 1k q m, d m1, r m1 q m, d mk, r mk Learning Scoring function h q *, d *,? Document Scoring q *, d *, h(q *, d * ) Not so easy when optimizing typical Information Retrieval measures. One simple reason is that they imply sorting (of documents), which is not a nicely derivable function Therefore we cannot apply gradient descent or similar Claudio Lucchese Big Data in Web Search 7

8 QuickScore A Learning-to-Rank function is typically implemented as a forest of thousands decision trees QuickScore is a cache-aware algorithm improving the scoring efficiency of tree-based ranking models It s 2x to 6.5x times faster than state-of-the-art implementations It visits a tree by touching a smaller number of nodes We also implemented an efficient multi-threaded learning toolkit, named QuickLearn, implementing a few variants of Gradient Boosted Regression Trees Claudio Lucchese Big Data in Web Search 8

9 Big Data in Web Search Europeana is the biggest European Cultural Heritage portal. Within the ASSETS EU CIP Project we designed a new ranking function, which was actually deployed on the portal! Users feedback (result licks) is exploited to learn document relevance Claudio Lucchese Big Data in Web Search 9

10 Big Data in Web Search Ranking design: define a ranking architecture, learn a ranking function, feature tuning, efficient scoring, etc. Data cleaning: remove near-duplicate documents from a collection of 6 billion documents with ~1000 CPU cores. NoSQL storage, massive Hadoop MapReduce computations Claudio Lucchese Big Data in Web Search 10

12 User Task Discovery (UTD) Users have interleaving multi/tasking behavior: Thanks to Users Task Discovery, it would be possible to bettern understand users, to recommend tasks, etc. Claudio Lucchese 2nd HPC Workshop - Playing with Learning to Rank 13

13 User Task Discovery (UTD) Easy process: Put weights on the similarity graph Remove low weighted edges Extract Connected Components How to find a good query similarity function? Claudio Lucchese 2nd HPC Workshop - Playing with Learning to Rank 14

14 User Task Discovery (UTD) Binary classification approach Given a set of task-annotated queries Build a training set of query pairs labeled as same task vs. different tasks Define the feature set: edlevgt2: edit distance between q i and q j wordr: Jaccard distance between the sets of words of q i and q j char_suf: number of common characters in q i and q j nsubst_q j _X: related to the probability of q j being reformulated time_diff: inter-query time gap between q i and q j sequential: binary feature that is positive if q i and q j are issued sequentially prisma: cosine between the two vectors of the top-50 pages retuned by a SE for q i and q j entropy_q i _X: measure the rewrite probabilities from q i σ jaccard_url : Jaccard similarity between the top-20 URLs returned by SE for qi and qj σ wikipedia : cosine based on Wikipedia articles containing qi and qj Learn a classifier optimizing classification accuracy Logistic Regression works sufficiently well Decision trees are slightly better We improved over the Query-Flow-Graph approach Claudio Lucchese 2nd HPC Workshop - Playing with Learning to Rank 15

15 Learning to Rank for UTD Learning to rank setting: A query is a query q i in the user session A document is any other query q j in the same user session Objective: rank same-task queries higher than different-task queries Clustering quality: Sample of the AOL 2006 query log: ~8,800 queries by 127 users, manually annotated into ~1350 tasks, ~6.5 queries per task Algo Rand Index Jacc. Index Avg. Fm Log.Reg Lambda-MART Claudio Lucchese 2nd HPC Workshop - Playing with Learning to Rank 16

17 Entity Linking The goal is to identify relevant entities mentioned by fragments of text. Entities are taken from a given catalogue, e.g. Wikipedia. Claudio Lucchese Big Data in Web Search 18

18 Entity Linking State-of-the-art approaches run three steps: 1. Spotting Given a document, find fragments of text potentially referring to entities, a.k.a. spots Common approach is to match anchors in Wikipedia Some spots are ambiguous, e.g. Michael Collins 2. Disambiguation Given a set of spots in a document, find the correct entity for each spot. Steps 1 and 2 are sometimes referred to as Word Sense Disambiguation 3. Link Detection Given a document, its terms and their senses, decide where to put links: E.g., Bank of London The main research questions is: How to improve the disambiguation step. Claudio Lucchese Big Data in Web Search 19

19 Saliency Driven Entity Linking Mentioned entities have not the same importance in the given document We defined 3 levels of saliency We trained a model to rank entities according to their expected saliency Table 2: Entity linking and saliency prediction performance. CoNLL Wikinews Rec Prec F 1 Rec Prec F 1 NDCG F top 1 GBDT-F GBRT-F LRC-F Tagme Wikiminer Spotlight Step GBRT-F l Claudio Lucchese Big Data in Web Search 26

20 What s Next The software we developed is open-source and available at Endless Applications: News stream analysis, for understanding, summarization, sentiment, Web queries and web documents annotation Claudio Lucchese Big Data in Web Search 27

22 News Recommendation How recommend to a given user the most relevant news? There are a lot of information sources Users are different Timeliness is crucial! Claudio Lucchese Big Data in Web Search 29

23 Time is Crucial delay between news publication and clicks Claudio Lucchese Big Data in Web Search 30

24 Time is crucial: Earthquakes and Twitter There are some arguments about twitter waves being faster than seismic waves Claudio Lucchese Big Data in Web Search 31

25 Time is crucial: News Agencies vs. Twitter Tweets about Osama Bin Laden death Twitter is sometimes a faster than other media Claudio Lucchese Big Data in Web Search 32

26 Twitter vs. News streams The main research questions are: Streaming analysis of twitter stream Detect trending topics early in twitter Model information spreading process in twitter Detect trending topics early in news Recommend news to users according to their tastes Our research question is: Can we use information both from the twitter stream and from the news stream to provide personalized news recommendation? Claudio Lucchese Big Data in Web Search 33

27 When a news is relevant? Our assumptions. A news article is interesting if: 1. If it discusses topics of interest to the user (e.g., computer science) 2. If it discusses topics of interest to the social network of the user (e.g., computer science, art exhibitions in Tuscany) 3. If it does not fall in the user s interests, but it is of general relevance (e.g., Ukraine crisis) We need a way to detect topics in text Topics provide an higher level view Topics can bridge the gap between twitter and news streams We model the relation between news, tweets and users in terms of the topics/entities they discuss Claudio Lucchese Big Data in Web Search 34

28 Tweets and news relatedness We extracted topics from tweets and from news Let Z be the set of topics Z In our case, Z is the set of Wikipedia pages Let T(i, j) be the relevance of topic z j for tweet t i Let N(j, k) be the relevance of topic z j for news n k The product M=TN is used to estimate the relatedness between tweets and news: M(i, k) is the relatedness between the tweet t i and the news n k based on the co-cited topics. Claudio Lucchese Big Data in Web Search 35

29 Content and Social relevance Content relatedness is based of the users history of tweets. Let the binary matrix A(u, i)=1 iff user u tweeted tweet t i Content relatedness is defined by the matrix Γ=AM, where Γ(u, k) is the relevance of news n k for user u based on the entities mentioned in u s tweets and in n k. Social relatedness is a function of the tweets in the social network S of the user = Xi=d i=1 i S i! A M σ is a dumping factor, d is the max distance in the social network Σ(u, k) is the relevance of news n k for users in the social circles of u Claudio Lucchese Big Data in Web Search 36

30 Topic popularity over time Let Z be a (row) vector where Z(j) is the popularity of topic z j Z = Z 1 + w T H T + w N H N. Three components contribute to the popularity of a topic The popularity at the previous time-step τ (exponential forgiving) The estimated popularity in the stream of tweets (first order derivative) The estimated popularity in the stream of news (first order derivative) Weights w T and w N are set equal in our experiments Topic popularity is defined by the matrix Π=ZN, where Π(k) is popularity of news article n k Based on the topics it discusses Claudio Lucchese Big Data in Web Search 37

31 Learning to rank formulation We say that the relevance of a news n for a user u is a linear combination of content, social relevance and topic popularity R (u, n) = (u, n) + (u, n) + (n) This can be thought as a ranking problem: given a set of news at time τ, sort them according to their relevance R τ (u,n) and propose the best to the user u. Learn a ranking/relevance function that promotes clicked news Find the best α, β,γ such that clicked news are ranked higher than non clicked ones. This can be mapped into a Support Vector Machines formulations Claudio Lucchese Big Data in Web Search 38

32 Dataset used 1 Million English tweets by 3,214 users (May 2011) 40,000 news articles from the Yahoo! News portal containing at least one topic mentioned in the tweet stream Yahoo! Toolbar data for clicks on news articles We also used toolbar data to link web users with twitter users Assuming his twitter account is the most visited (filtering out popular public persons, celebrities, etc.) The training and test set are built under the assumption that a clicked news article must be the top ranked at the time of its publication. Claudio Lucchese Big Data in Web Search 39

33 1 T.Rex: 10 Twitter-based 100 News 1000 Recommendation System: Entity Predicting clicked entities stribution of entities (b) News. j=1 where G[j] is the entity intersection with the clicked news (rescaled in 0,,4) Average DCG DCG(N) = T.Rex+ T.Rex Popularity Content Social Recency Click count N X G[j] log(j + 1) Rank Claudio Lucchese Big Data in Web Search 40

34 What s Next The mining of twitter data requires large-scale tools, everything was implemented with MapReduce. Streaming data mining tools should be adopted More complex/interesting models instead of linear combination Correlation of twitter streams with other streams, e.g. query logs Claudio Lucchese Big Data in Web Search 41

36 (credits to David Crandall et al., Cornell University) Claudio Lucchese Big Data in Web Search 43

37 Research Challenges The analysis of a large and noisy collection of social geo-tagged photos poses several challenges: 1. Clean and organize the collection in semantically coherent clusters 2. Associate relevant PoIs with these clusters 3. Devise routes of tourists through these PoIs and characterize as precisely as possible the behaviors of tourists 4. Extract and Exploit such knowledge pontevecchi o trip firenze palazzo vecchio canon florence Our research question is: How to provide personalized PoI recommendations?

38 How much can we understand from photos?

39 Visual Clustering Goal: to reduce the cost of computing similarity H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. Computer Vision ECCV 2006 G Csurka, C. Dance, L Fan, J Willamowski, and C. Bray. Visual categorization with bags of keypoints. ECCV, 2004.

40 Labeling with tags Two key ideas: Using the spatial relevance of tags Measure: ratio between the tag area and the overall geographical area analyzed Using the social relevance of tags Measure: number of different users using a given tag

41 Clustering and enriching Flickr photos Lucca, Toscana, Vacanze 2011, Tuscany, Italy, Summer 2011, Oval Square, Piazza dell anfiteatro, Piazza del mercato, square, happy new year, windows, balcony, canon, nikon, Claudio Lucchese Big Data in Web Search 48

42 Enriching older photos Claudio Lucchese Big Data in Web Search 49

43 Mining Trajectories from Flickr Colosseum 3 photos 01/07/2013 9:00-12:00 Ruins 2 photos 01/07/ :30-15:00 Devise patterns of tourists behavior... Trevi Fountain 2 photos 01/07/ :42-16:00

44 Planning Sightseeing Tours with TripBuilder Golden Gate Bridge What should I visit 4"h" in San Francisco? Golden Gate Park California Academy of Sciences Given: 4"h" Time: de Young 2 Museum days; My preferences. 8"h" San Francisco Museum of Modern Art Aquarium of the Bay Alcatraz How many How do of these other tourists trajectories visit can such I enjoy? places?

45 The TripCover Problem Given:' A"set"of"popular"trajectories" crossing"a"set"of"pois"and" their"8me"cost"" The"relevance'of"the" trajectories"w.r.t."the" category"set" The'Time"Budget'and' Preferences'of"a"user"" A"measure"of"PoI6User' interest' Find:' the"subset"of"trajectories"that" maximizes"user"interest"and" fits"in"the"8me"budget" TripCover'is"an"instance"of"the"Generalized'Maximum'Coverage"(GMC)" problem."npihard"with"a"(e/(ei1))iapproxima8on"algorithm."

46 TrajSP: joining the trajectories TripCover solution is a set of trajectories fitting user interest and time budget Local search heuristics for connecting the solution in a single sightseeing tour l(i, k) = 4 e(i) e(i) i n(k) i n(k) i n(i) k n(i) k k e(k) e(k) (a) (b) (c) l(i, k) = 3 e(i) n(e(i)) e(i) n(e(i)) i i i n(i) e(k) n(i) e(k) n(k) k n(k) k (d) (e) (f) k Claudio Lucchese Big Data in Web Search 53

47 Claudio Lucchese Big Data in Web Search 54

48 The End! "Data! Data! Data!" he cried impatiently. "I can't make bricks without clay." Claudio Lucchese Big Data in Web Search 55