What to Mine from Big Data? Hang Li Noah s Ark Lab Huawei Technologies
Big Data Value
Two Main Issues in Big Data Mining
Agenda Four Principles for What to Mine Stories regarding to Principles Search and Browse Log Mining as Example Our Work on Big Data Mining Mining Query Subtopics from Search Log Data Summary
Four Principles for What to Mine 1. Identifying scenarios of mining as much as possible 2. Logging as much data as possible 3. Integrating as much data as possible 4. Understanding data as much as possible
Identifying scenarios of mining as much as possible
Immanuel Kant The world as we know it is our interpretation of the observable facts in the light of theories that we ourselves invent
Example of Bad Design of Toolbar A toolbar developed at a search engine It recorded user s search behavior data However, It did not record the time at which the user closed browser No indication of end of session
Logging as much data as possible
Examples of Useful Log Information User moves mouse on screen (user may unconsciously put mouse on focused area) may infer users interest on the page User uses mouse to scroll up and down may infer whether user is serious about page content (more scrolling suggests more seriousness) User clicks on next page may infer user s current focus User closes browser window/tab may infer user s current focus
Integrating as much data as possible
Model of User Search Behavior Data needs to be collected from different sources (toolbar, search engine log) E.g., toolbar usually does not record search results Often challenging to integrate data
Understanding Data as Much as Possible
AOL Search Data Leak (2006) AOL search data release (20M queries, 650K users, 3 months) New York Times article A Face Is Exposed for AOL Searcher No. 4417749 Queries landscapers in Lilburn, Ga several people with the last name Arnold homes sold in shadow lake subdivision gwinnett county georgia. ''dog that urinates on everything 60 single men Identified searcher is Thelma Arnold, a widow living in Georia
Mining Query Subtopics from Search Log Data Yunhua Hu, Yanan Qian 1, Hang Li, Daxin Jiang, Jian Pei 2, and Qinghua Zheng 1 Microsoft Research Asia, Beijing, China 1 SPKLSTN Lab, Xi'an Jiaotong University, China 2 Simon Fraser University, Burnaby, BC, Canada
Outline Introduction Our Method Experiments Conclusion 16
Demo
Mined Subtopics
Subtopics of Query Most queries are ambiguous or multifaceted in web search Harry Shum Harry Shum Microsoft Harry Shum Jr XBox XBox games XBox homepage XBox marketplace Major senses and facets of query (subtopics) 21
Our Work = Automatically Mining Subtopics of Queries from Search Log Data
Phenomenon 1: One Subtopic per Search (OSS) Query Multi-Clicked URLs (Multi-Clicks) Frequency "Harry Shum" "http://research.microsoft.com/en-us/people/hshum, http://en.wikipedia.org/wiki/harry_shum, http://www.microsoft.com/presspass/exec/shum/" "http://en.wikipedia.org/wiki/harry_shum,_jr, http://www.washingtonpost.com/.../vi2011022701183.html" 50 95 Jointly Clicked URLs in the same searches tend to represent the same subtopics
Phenomenon 2: Subtopic Clarification by Additional Keyword (SCAK) Query "Harry Shum" Microsoft Harry Shum" "Harry Shum Jr" "Harry Shum Glee Clicked URLs "http://research.microsoft.com/en-us/people/hshum", "http://en.wikipedia.org/wiki/harry_shum,_jr", "http://en.wikipedia.org/wiki/harry_shum", "http://www.washingtonpost.com/.../vi2011022701183.html" "http://www.microsoft.com/presspass/exec/shum/" "http://research.microsoft.com/en-us/people/hshum", "http://en.wikipedia.org/wiki/harry_shum", http://www.microsoft.com/presspass/exec/shum/ "http://en.wikipedia.org/wiki/harry_shum,_jr", "http://www.washingtonpost.com/.../vi2011022701183.html" "http://en.wikipedia.org/wiki/harry_shum,_jr", "http://www.washingtonpost.com/.../vi2011022701183.html" URLs clicked in searches of the query and its expanded queries tend to represent the same subtopics.
Outline Introduction Our Method Experiments Conclusion 25
Our Approach Mining subtopics of queries by leveraging the two phenomena Subtopics of query are represented by URLs Keywords in expanded queries Example of subtopic Subtopi Keywords (in bold face) 1 harry shum microsoft harry shum bing microsoft harry shum 2 harry shum jr harry shum glee harry shum junior URLs http://en.wikipedia.org/wiki/harry_shum http://research.microsoft.com/en-us/people/hshum/ http://www.microsoft.com/presspass/exec/shum/ http://en.wikipedia.org/wiki/harry_shum,_jr. http://harryshumjr.com/ http://www.imdb.com/name/nm1484270/ 26
Flow of Clustering Method 27
Preprocessing Tree structure to index queries ( Q+W and W+Q for Q ) Pruning: Only keep expanded queries with URL overlap 28
Similarity Calculation between URLs S 1 : Similarity based URLs on OSS S 2 : Similarity based on SCAK S 3 : Similarity between URL tokens Multi- Click 1 Multi- Click 2 "http://en.wikipedia.org/wiki/harry_shum" 4 3 0 "http://www.microsoft.com/presspass/exec/shum/" 4 0 3 Multi- Click 3 N/A N/A 0.64 N/A N/A N/A N/A N/A 0.96 N/A Similarity Matrix of S 1 Similarity Matrix of S 2 URLs Jr Glee Microsoft "http://en.wikipedia.org/wiki/harry_shum,_jr" 3 4 0 "http://www.imdb.com/name/nm1484270/" 4 3 0
Clustering Algorithm Agglomerative clustering algorithm Two URLs are similar if the similarity is larger than a threshold Each maximum connected subgraph (a group of urls) represents a subtopic Algorithm is efficient and easy to implement 30
Outline Introduction Our Method Experiments Conclusion 31
Data Set and Parameter Setting One open dataset + two proprietary datasets Evaluation metric: B-cubed precision, recall, and F1 Manually tune the parameters in 1/3 of DataSetA 32
Evaluation of Subtopic Mining Evaluation on different similarity functions Evaluation on different types of queries 33
Application in Search Result Clustering (1) Search result clustering approaches Baseline: Wang and Zhai s work in SIGIR 07 Our approach: "subtopics of query as seed clusters" + traditional URL clustering Evaluation on TREC and DataSetA 34
Application in Search Result Clustering (2) Manual evaluation on DataSetB from various perspectives Side-by-side evaluation on DataSetB 35
Application in Search Results Re-ranking (1) 36
Application in Search Results Re-ranking (2) 37
Outline Introduction Our Method Experiments Conclusion 38
Conclusion Discovered two phenomena in search log data to represent query subtopics Developed a clustering method for subtopic mining Applied the mined subtopics into two tasks: search result clustering and re-ranking 39
Strength and Limitation of Big Data Mining Big data really creates big value Importance of insight Log tail challenges Mining needs knowledge 40
Summary Two Major Issues: What to Mine and How to Mine Four Principles for What to Mine Stories regarding to Principles Search and Browse Log Mining as Example Our Work on Big Data Mining Mining Query Subtopics from Search Log Data
Thanks! hangli-hl@huawei.com 42