What to Mine from Big Data? Hang Li Noah s Ark Lab Huawei Technologies

Big Data Value

Two Main Issues in Big Data Mining

Agenda Four Principles for What to Mine Stories regarding to Principles Search and Browse Log Mining as Example Our Work on Big Data Mining Mining Query Subtopics from Search Log Data Summary

Four Principles for What to Mine 1. Identifying scenarios of mining as much as possible 2. Logging as much data as possible 3. Integrating as much data as possible 4. Understanding data as much as possible

Identifying scenarios of mining as much as possible

Immanuel Kant The world as we know it is our interpretation of the observable facts in the light of theories that we ourselves invent

Example of Bad Design of Toolbar A toolbar developed at a search engine It recorded user s search behavior data However, It did not record the time at which the user closed browser No indication of end of session

Logging as much data as possible

Examples of Useful Log Information User moves mouse on screen (user may unconsciously put mouse on focused area) may infer users interest on the page User uses mouse to scroll up and down may infer whether user is serious about page content (more scrolling suggests more seriousness) User clicks on next page may infer user s current focus User closes browser window/tab may infer user s current focus

Integrating as much data as possible

Model of User Search Behavior Data needs to be collected from different sources (toolbar, search engine log) E.g., toolbar usually does not record search results Often challenging to integrate data

Understanding Data as Much as Possible

AOL Search Data Leak (2006) AOL search data release (20M queries, 650K users, 3 months) New York Times article A Face Is Exposed for AOL Searcher No. 4417749 Queries landscapers in Lilburn, Ga several people with the last name Arnold homes sold in shadow lake subdivision gwinnett county georgia. ''dog that urinates on everything 60 single men Identified searcher is Thelma Arnold, a widow living in Georia

Mining Query Subtopics from Search Log Data Yunhua Hu, Yanan Qian 1, Hang Li, Daxin Jiang, Jian Pei 2, and Qinghua Zheng 1 Microsoft Research Asia, Beijing, China 1 SPKLSTN Lab, Xi'an Jiaotong University, China 2 Simon Fraser University, Burnaby, BC, Canada

Outline Introduction Our Method Experiments Conclusion 16

Mined Subtopics

Subtopics of Query Most queries are ambiguous or multifaceted in web search Harry Shum Harry Shum Microsoft Harry Shum Jr XBox XBox games XBox homepage XBox marketplace Major senses and facets of query (subtopics) 21

Our Work = Automatically Mining Subtopics of Queries from Search Log Data

Phenomenon 1: One Subtopic per Search (OSS) Query Multi-Clicked URLs (Multi-Clicks) Frequency "Harry Shum" "http://research.microsoft.com/en-us/people/hshum, http://en.wikipedia.org/wiki/harry_shum, http://www.microsoft.com/presspass/exec/shum/" "http://en.wikipedia.org/wiki/harry_shum,_jr, http://www.washingtonpost.com/.../vi2011022701183.html" 50 95 Jointly Clicked URLs in the same searches tend to represent the same subtopics

Phenomenon 2: Subtopic Clarification by Additional Keyword (SCAK) Query "Harry Shum" Microsoft Harry Shum" "Harry Shum Jr" "Harry Shum Glee Clicked URLs "http://research.microsoft.com/en-us/people/hshum", "http://en.wikipedia.org/wiki/harry_shum,_jr", "http://en.wikipedia.org/wiki/harry_shum", "http://www.washingtonpost.com/.../vi2011022701183.html" "http://www.microsoft.com/presspass/exec/shum/" "http://research.microsoft.com/en-us/people/hshum", "http://en.wikipedia.org/wiki/harry_shum", http://www.microsoft.com/presspass/exec/shum/ "http://en.wikipedia.org/wiki/harry_shum,_jr", "http://www.washingtonpost.com/.../vi2011022701183.html" "http://en.wikipedia.org/wiki/harry_shum,_jr", "http://www.washingtonpost.com/.../vi2011022701183.html" URLs clicked in searches of the query and its expanded queries tend to represent the same subtopics.

Our Approach Mining subtopics of queries by leveraging the two phenomena Subtopics of query are represented by URLs Keywords in expanded queries Example of subtopic Subtopi Keywords (in bold face) 1 harry shum microsoft harry shum bing microsoft harry shum 2 harry shum jr harry shum glee harry shum junior URLs http://en.wikipedia.org/wiki/harry_shum http://research.microsoft.com/en-us/people/hshum/ http://www.microsoft.com/presspass/exec/shum/ http://en.wikipedia.org/wiki/harry_shum,_jr. http://harryshumjr.com/ http://www.imdb.com/name/nm1484270/ 26

Flow of Clustering Method 27

Preprocessing Tree structure to index queries ( Q+W and W+Q for Q ) Pruning: Only keep expanded queries with URL overlap 28

Similarity Calculation between URLs S 1 : Similarity based URLs on OSS S 2 : Similarity based on SCAK S 3 : Similarity between URL tokens Multi- Click 1 Multi- Click 2 "http://en.wikipedia.org/wiki/harry_shum" 4 3 0 "http://www.microsoft.com/presspass/exec/shum/" 4 0 3 Multi- Click 3 N/A N/A 0.64 N/A N/A N/A N/A N/A 0.96 N/A Similarity Matrix of S 1 Similarity Matrix of S 2 URLs Jr Glee Microsoft "http://en.wikipedia.org/wiki/harry_shum,_jr" 3 4 0 "http://www.imdb.com/name/nm1484270/" 4 3 0

Clustering Algorithm Agglomerative clustering algorithm Two URLs are similar if the similarity is larger than a threshold Each maximum connected subgraph (a group of urls) represents a subtopic Algorithm is efficient and easy to implement 30

Data Set and Parameter Setting One open dataset + two proprietary datasets Evaluation metric: B-cubed precision, recall, and F1 Manually tune the parameters in 1/3 of DataSetA 32

Evaluation of Subtopic Mining Evaluation on different similarity functions Evaluation on different types of queries 33

Application in Search Result Clustering (1) Search result clustering approaches Baseline: Wang and Zhai s work in SIGIR 07 Our approach: "subtopics of query as seed clusters" + traditional URL clustering Evaluation on TREC and DataSetA 34

Application in Search Result Clustering (2) Manual evaluation on DataSetB from various perspectives Side-by-side evaluation on DataSetB 35

Application in Search Results Re-ranking (1) 36

Application in Search Results Re-ranking (2) 37

Conclusion Discovered two phenomena in search log data to represent query subtopics Developed a clustering method for subtopic mining Applied the mined subtopics into two tasks: search result clustering and re-ranking 39

Strength and Limitation of Big Data Mining Big data really creates big value Importance of insight Log tail challenges Mining needs knowledge 40

Summary Two Major Issues: What to Mine and How to Mine Four Principles for What to Mine Stories regarding to Principles Search and Browse Log Mining as Example Our Work on Big Data Mining Mining Query Subtopics from Search Log Data

Thanks! hangli-hl@huawei.com 42