What to Mine from Big Data? Hang Li Noah s Ark Lab Huawei Technologies

Similar documents
CS377: Database Systems Data Security and Privacy. Li Xiong Department of Mathematics and Computer Science Emory University

Big Data in The Web. Agenda. Big Data Asking the Right Questions Wisdom of Crowds in the Web The Long Tail Issues and Examples Concluding Remarks

Partner Camp Leistungsstarkes Log-Management für physische, virtuelle und cloud-basierte Umgebungen. Tomas Baublys

Prerequisites. Course Outline

De-identification Koans. ICTR Data Managers Darren Lacey January 15, 2013

Supporting Privacy Protection in Personalized Web Search

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Privacy and Privacy-Enhancing Technologies for Big Data Analytics

Chapter Website Management Instructions

Adding Links to Resources

Microsoft Dynamics CRM Clients

Optimizing Display Advertisements Based on Historic User Trails

Optimize Your Content

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search

SQL Server 2014 BI. Lab 04. Enhancing an E-Commerce Web Application with Analysis Services Data Mining in SQL Server Jump to the Lab Overview

Personalization of Web Search With Protected Privacy

If you have any questions or problems along the way, please don't hesitate to call, , or drop in to see us. We'd be happy to help you.

The Microsoft-Yahoo! Search Alliance: impact on the Search Advertising Landscape

Identifying Best Bet Web Search Results by Mining Past User Behavior

Using the CCNY Server Space with Secure Shell 3.0 for Windows Created by Doris Grasserbauer

On the Fly Query Segmentation Using Snippets

TEMPER : A Temporal Relevance Feedback Method

SEO for Profit. A Wordtracker Masterclass in search engine optimization. Mark Nunney

Classroom Management Solutions. Classroom Instruction and Monitoring Always Monitoring, Always Protecting, Always Teaching

How To Connect Your Event To PayPal

Data Mining, Predictive Analytics with Microsoft Analysis Services and Excel PowerPivot

Silect Software s MP Author

How much can Behavioral Targeting Help Online Advertising? Jun Yan 1, Ning Liu 1, Gang Wang 1, Wen Zhang 2, Yun Jiang 3, Zheng Chen 1

KICK YOUR CONTENT MARKETING STRATEGY INTO HIGH GEAR

BIG DATA ANALYTICS: MANAGING BUSINESS DATA COSTS AND DATA QUALITY IN THE CAPITAL MARKETS

SPHOL207: Database Snapshots with SharePoint 2013

The 2006 IEEE / WIC / ACM International Conference on Web Intelligence Hong Kong, China

Management Decision Making. Hadi Hosseini CS 330 David R. Cheriton School of Computer Science University of Waterloo July 14, 2011

Monitoring SQL Server with Microsoft Operations Manager 2005

Research on Application of Web Log Analysis Method in Agriculture Website Improvement

Search Query and Matching Approach of Information Retrieval in Cloud Computing

SPHOL325: SharePoint Server 2013 Search Connectors and Using BCS

Introducing Bing Shopping Campaigns beta

2. A typical business process

User Modeling in Big Data. Qiang Yang, Huawei Noah s Ark Lab and Hong Kong University of Science and Technology 杨 强, 华 为 诺 亚 方 舟 实 验 室, 香 港 科 大

Using Outlook Web Access

IBM Rational University. Essentials of IBM Rational RequisitePro v7.0 REQ370 / RR331 October 2006 Student Workbook Part No.

An AppDynamics Business White Paper October How Much Revenue Does IT Generate? Correlating Revenue and Application Performance

Mining Generalized Query Patterns from Web Logs

Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control. Phudinan Singkhamfu, Parinya Suwanasrikham

Drupal Training. Create Content Creating content is the fundamental basis for building the UCSD School of Medicine's website.

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Internet Marketing Guide

Effective Prediction of Kid s Behaviour Based on Internet Use

NewsEdge.com User Guide

Practical Graph Mining with R. 5. Link Analysis

ROI-Based Campaign Management: Optimization Beyond Bidding

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

ADHAWK WORKS ADVERTISING ANALTICS ON A DASHBOARD

Host Fingerprinting and Tracking on the Web: Privacy and Security Implications

Microsoft Word Research - Providing SharePoint Search features from within Microsoft Office 2010 and 2013

ezsupport What is it? ezsupport Demo ezsupport Demo HostedSupport.com 1

Leveraging Social Media

CIKM 2015 Melbourne Australia Oct. 22, 2015 Building a Better Connected World with Data Mining and Artificial Intelligence Technologies

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Neustar Intelligent Cloud Services

Managing Incompleteness, Complexity and Scale in Big Data

An Exploration of Ranking Heuristics in Mobile Local Search

Keywords the Most Important Item in SEO

Creating and Implementing an Organic Search Engine Optimization (SEO) Strategy. Join the Conversation Webinars World Services Group

Citrix Receiver. Configuration and User Guide. For Macintosh Users

CITRIX TROUBLESHOOTING TIPS

Research of Postal Data mining system based on big data

CITATION METRICS WORKSHOP ANALYSIS & INTERPRETATION WEB OF SCIENCE Prepared by Bibliometric Team, NUS Libraries. April 2014.

WHAT IS THE TEMPORAL VALUE OF WEB SNIPPETS?

How To Track Your Ads On Bing On A Pc Or Pcf On A Microsoft Macbook V2.2.5 (For Pc) On A Macbook Or Bing Ppl On A Web Browser On A Blackberry Or Ip

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

GOOGLE ANALYTICS. For Objective SEO and Diagnostics

Executive Dashboard Cookbook

Analyzing Customer Churn in the Software as a Service (SaaS) Industry

Mining Big Data Quickly. Matt Saunders

Top 3 Marketing Metrics You Should Measure in Google Analytics

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

Data Mining for Profit

Constructing Social Intentional Corpora to Predict Click-Through Rate for Search Advertising

CHEAT SHEET GETTING KEYWORD IDEAS

AKADEMOS TEXTBOOK ADOPTION TOOL

They can be obtained in HQJHQH format directly from the home page at:

Creating a Participants Mailing and/or Contact List:

Atlanta Props How to Add a New Post. 1. Log into the account at using your username and password

Conversion Rate Optimisation Guide

Enhance Preprocessing Technique Distinct User Identification using Web Log Usage data

Web Mining as a Tool for Understanding Online Learning

Search Engine Optimization A Beginner s Guide to Climbing Search Engine s Rankings

Remote Desktop Web Access. Using Remote Desktop Web Access

Creative Stream }Content Management System (CMS)

Lawson Portal User s Manual

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Improving Search Engines via Classification

How To Cluster On A Search Engine

Search Result Optimization using Annotators

Random forest algorithm in big data environment

A UPS Framework for Providing Privacy Protection in Personalized Web Search

Transcription:

What to Mine from Big Data? Hang Li Noah s Ark Lab Huawei Technologies

Big Data Value

Two Main Issues in Big Data Mining

Agenda Four Principles for What to Mine Stories regarding to Principles Search and Browse Log Mining as Example Our Work on Big Data Mining Mining Query Subtopics from Search Log Data Summary

Four Principles for What to Mine 1. Identifying scenarios of mining as much as possible 2. Logging as much data as possible 3. Integrating as much data as possible 4. Understanding data as much as possible

Identifying scenarios of mining as much as possible

Immanuel Kant The world as we know it is our interpretation of the observable facts in the light of theories that we ourselves invent

Example of Bad Design of Toolbar A toolbar developed at a search engine It recorded user s search behavior data However, It did not record the time at which the user closed browser No indication of end of session

Logging as much data as possible

Examples of Useful Log Information User moves mouse on screen (user may unconsciously put mouse on focused area) may infer users interest on the page User uses mouse to scroll up and down may infer whether user is serious about page content (more scrolling suggests more seriousness) User clicks on next page may infer user s current focus User closes browser window/tab may infer user s current focus

Integrating as much data as possible

Model of User Search Behavior Data needs to be collected from different sources (toolbar, search engine log) E.g., toolbar usually does not record search results Often challenging to integrate data

Understanding Data as Much as Possible

AOL Search Data Leak (2006) AOL search data release (20M queries, 650K users, 3 months) New York Times article A Face Is Exposed for AOL Searcher No. 4417749 Queries landscapers in Lilburn, Ga several people with the last name Arnold homes sold in shadow lake subdivision gwinnett county georgia. ''dog that urinates on everything 60 single men Identified searcher is Thelma Arnold, a widow living in Georia

Mining Query Subtopics from Search Log Data Yunhua Hu, Yanan Qian 1, Hang Li, Daxin Jiang, Jian Pei 2, and Qinghua Zheng 1 Microsoft Research Asia, Beijing, China 1 SPKLSTN Lab, Xi'an Jiaotong University, China 2 Simon Fraser University, Burnaby, BC, Canada

Outline Introduction Our Method Experiments Conclusion 16

Demo

Mined Subtopics

Subtopics of Query Most queries are ambiguous or multifaceted in web search Harry Shum Harry Shum Microsoft Harry Shum Jr XBox XBox games XBox homepage XBox marketplace Major senses and facets of query (subtopics) 21

Our Work = Automatically Mining Subtopics of Queries from Search Log Data

Phenomenon 1: One Subtopic per Search (OSS) Query Multi-Clicked URLs (Multi-Clicks) Frequency "Harry Shum" "http://research.microsoft.com/en-us/people/hshum, http://en.wikipedia.org/wiki/harry_shum, http://www.microsoft.com/presspass/exec/shum/" "http://en.wikipedia.org/wiki/harry_shum,_jr, http://www.washingtonpost.com/.../vi2011022701183.html" 50 95 Jointly Clicked URLs in the same searches tend to represent the same subtopics

Phenomenon 2: Subtopic Clarification by Additional Keyword (SCAK) Query "Harry Shum" Microsoft Harry Shum" "Harry Shum Jr" "Harry Shum Glee Clicked URLs "http://research.microsoft.com/en-us/people/hshum", "http://en.wikipedia.org/wiki/harry_shum,_jr", "http://en.wikipedia.org/wiki/harry_shum", "http://www.washingtonpost.com/.../vi2011022701183.html" "http://www.microsoft.com/presspass/exec/shum/" "http://research.microsoft.com/en-us/people/hshum", "http://en.wikipedia.org/wiki/harry_shum", http://www.microsoft.com/presspass/exec/shum/ "http://en.wikipedia.org/wiki/harry_shum,_jr", "http://www.washingtonpost.com/.../vi2011022701183.html" "http://en.wikipedia.org/wiki/harry_shum,_jr", "http://www.washingtonpost.com/.../vi2011022701183.html" URLs clicked in searches of the query and its expanded queries tend to represent the same subtopics.

Outline Introduction Our Method Experiments Conclusion 25

Our Approach Mining subtopics of queries by leveraging the two phenomena Subtopics of query are represented by URLs Keywords in expanded queries Example of subtopic Subtopi Keywords (in bold face) 1 harry shum microsoft harry shum bing microsoft harry shum 2 harry shum jr harry shum glee harry shum junior URLs http://en.wikipedia.org/wiki/harry_shum http://research.microsoft.com/en-us/people/hshum/ http://www.microsoft.com/presspass/exec/shum/ http://en.wikipedia.org/wiki/harry_shum,_jr. http://harryshumjr.com/ http://www.imdb.com/name/nm1484270/ 26

Flow of Clustering Method 27

Preprocessing Tree structure to index queries ( Q+W and W+Q for Q ) Pruning: Only keep expanded queries with URL overlap 28

Similarity Calculation between URLs S 1 : Similarity based URLs on OSS S 2 : Similarity based on SCAK S 3 : Similarity between URL tokens Multi- Click 1 Multi- Click 2 "http://en.wikipedia.org/wiki/harry_shum" 4 3 0 "http://www.microsoft.com/presspass/exec/shum/" 4 0 3 Multi- Click 3 N/A N/A 0.64 N/A N/A N/A N/A N/A 0.96 N/A Similarity Matrix of S 1 Similarity Matrix of S 2 URLs Jr Glee Microsoft "http://en.wikipedia.org/wiki/harry_shum,_jr" 3 4 0 "http://www.imdb.com/name/nm1484270/" 4 3 0

Clustering Algorithm Agglomerative clustering algorithm Two URLs are similar if the similarity is larger than a threshold Each maximum connected subgraph (a group of urls) represents a subtopic Algorithm is efficient and easy to implement 30

Outline Introduction Our Method Experiments Conclusion 31

Data Set and Parameter Setting One open dataset + two proprietary datasets Evaluation metric: B-cubed precision, recall, and F1 Manually tune the parameters in 1/3 of DataSetA 32

Evaluation of Subtopic Mining Evaluation on different similarity functions Evaluation on different types of queries 33

Application in Search Result Clustering (1) Search result clustering approaches Baseline: Wang and Zhai s work in SIGIR 07 Our approach: "subtopics of query as seed clusters" + traditional URL clustering Evaluation on TREC and DataSetA 34

Application in Search Result Clustering (2) Manual evaluation on DataSetB from various perspectives Side-by-side evaluation on DataSetB 35

Application in Search Results Re-ranking (1) 36

Application in Search Results Re-ranking (2) 37

Outline Introduction Our Method Experiments Conclusion 38

Conclusion Discovered two phenomena in search log data to represent query subtopics Developed a clustering method for subtopic mining Applied the mined subtopics into two tasks: search result clustering and re-ranking 39

Strength and Limitation of Big Data Mining Big data really creates big value Importance of insight Log tail challenges Mining needs knowledge 40

Summary Two Major Issues: What to Mine and How to Mine Four Principles for What to Mine Stories regarding to Principles Search and Browse Log Mining as Example Our Work on Big Data Mining Mining Query Subtopics from Search Log Data

Thanks! hangli-hl@huawei.com 42