1 o Semestre 2007/2008



Similar documents
Part 1: Link Analysis & Page Rank

An Analysis of Factors Used in Search Engine Ranking

Practical Graph Mining with R. 5. Link Analysis

Supervised Learning Evaluation (via Sentiment Analysis)!

Web Search. 2 o Semestre 2012/2013

Search and Information Retrieval

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Search engines: ranking algorithms

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

The PageRank Citation Ranking: Bring Order to the Web

Social Business Intelligence Text Search System

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Information Retrieval Models

Graph Mining and Social Network Analysis

Mining Text Data: An Introduction

Social Media Mining. Network Measures

Ranking on Data Manifolds

Eng. Mohammed Abdualal

Development of an Enhanced Web-based Automatic Customer Service System

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

An Information Retrieval using weighted Index Terms in Natural Language document collections

Yifan Chen, Guirong Xue and Yong Yu Apex Data & Knowledge Management LabShanghai Jiao Tong University

CSCI 5417 Information Retrieval Systems Jim Martin!

Search Engines. Stephen Shaw 18th of February, Netsoc

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Introduction to Information Retrieval

Query Recommendation employing Query Logs in Search Optimization

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Big Data Analytics CSCI 4030

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

Trust and Reputation Management

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.)

Data Pre-Processing in Spam Detection

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Big Data Analytics. Lucas Rego Drumond

Information Retrieval Elasticsearch

Mining Web Informative Structures and Contents Based on Entropy Analysis

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Protein Protein Interaction Networks

IC05 Introduction on Networks &Visualization Nov

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Department of Cognitive Sciences University of California, Irvine 1

Towards running complex models on big data

Analysis of MapReduce Algorithms

Recommending Web Pages using Item-based Collaborative Filtering Approaches

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Social Media Mining. Data Mining Essentials

Topic models for Sentiment analysis: A Literature Survey

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Why is Internal Audit so Hard?

Machine Learning using MapReduce

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Enhancing the Ranking of a Web Page in the Ocean of Data

Web Graph Analyzer Tool

IBM SPSS Modeler Social Network Analysis 15 User Guide

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Text and Web Mining A big challenge for Data Mining. Nguyen Hung Son Warsaw University

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1

Dynamical Clustering of Personalized Web Search Results

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

PageRank Conveniention of a Single Web Pageboard

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

DATA ANALYSIS II. Matrix Algorithms

Graph Processing and Social Networks

An Effective Risk Avoidance Scheme for the EigenTrust Reputation Management System

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Spontaneous Code Recommendation based on Open Source Code Repository

An ontology-based approach for semantic ranking of the web search engines results

Efficient Identification of Starters and Followers in Social Media

Social Search. Communities of users actively participating in the search process

Tutorial, IEEE SERVICE 2014 Anchorage, Alaska

Introduction to Information Retrieval

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

A survey on the use of relevance feedback for information access systems

Topics in basic DBMS course

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Top Online Activities (Jupiter Communications, 2000) CS276A Text Information Retrieval, Mining, and Exploitation

Web Mining Techniques for Query Log Analysis and Expertise Retrieval

Statistical Models in Data Mining

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Content-Based Image Retrieval

Outline. for Making Online Advertising Decisions. The first banner ad in Online Advertising. Online Advertising.

Fault Analysis in Software with the Data Interaction of Classes

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Web Data Extraction: 1 o Semestre 2007/2008

Mining Social Network Graphs

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Transcription:

Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008

Outline 1 2 3 4 5

Outline 1 2 3 4 5

Exploiting Text How is text exploited? Two main directions Extraction

Extraction Entity and relationship (link) extraction Entity resolution/matching Other types of extraction: Events Opinions Sentiments IE bibliography http://scratchpad.wikia.com/wiki/dblife_bibs

Goals: Representation, organization, storage and access to information items in order to provide the user with easy access to information The emphasis is on information

vs. Data Data retrieval Given a specified condition (e.g. {lab, ethics} document), find all items that satisfy the condition retrieval Given a user query, find all items that contain information relevant to the user s needs However...... how do you characterize the user s information need?

vs. Data Data retrieval Given a specified condition (e.g. {lab, ethics} document), find all items that satisfy the condition retrieval Given a user query, find all items that contain information relevant to the user s needs However...... how do you characterize the user s information need?

Translating the user information need An example Find all pages containing information on the ethical treatment of animals for medical experiments. The pages should contain references to recent related scientific articles, together with an enumeration of known existing alternatives for different medical fields. try this on Google Usually this is translated to ethics animals medical experiments but is this a convenient translation?

Translating the user information need An example Find all pages containing information on the ethical treatment of animals for medical experiments. The pages should contain references to recent related scientific articles, together with an enumeration of known existing alternatives for different medical fields. try this on Google Usually this is translated to ethics animals medical experiments but is this a convenient translation?

Outline 1 2 3 4 5

IR Tasks Document processing Indexing Crawling Query processing Distributed IR String processing... processing Ad-hoc retrieval Classification Clustering Filtering Question answering...

The Process

s IR s Classic models Boolean Vector Probabilistic Fuzzy Extended Boolean... LSI Neural Networks... Belief Network Language s... Alternative models

Outline 1 2 3 4 5

Index Terms In the classic IR models, documents are represented by index terms full text/selected keywords structure/no structure Not all terms are equally useful index terms can be weighted We assume that terms are mutually independent this is, of course, a simplification

An Example Example document I heartily accept the motto, That government is best which governs least ; and I should like to see it acted up to more rapidly and systematically. Carried out, it finally amounts to this, which also I believe That government is best which governs not at all ; and when men are prepared for it, that will be the kind of government which they will have.

An Example Index terms I accept acted all also amounts and are at be believe best carried finally for government governs have heartily is it kind least like men more motto not of out prepared rapidly see should systematically that the they this to up when which will

An Example Index terms I 3 accept 1 acted 2 all 3 also 1 amounts 1 and 3 are 1 at 1 be 1 believe 1 best 2 carried 1 finally 1 for 1 government 3 governs 2 have 1 heartily 1 is 2 it 3 kind 1 least 1 like 1 men 1 more 1 motto 1 not 1 of 1 out 1 prepared 1rapidly 1 see 1 should 1 systematically 1 that 3 the 2 they 1 this 1 to 3 up 1 when 1 which 4 will 2

An Example Index terms I 3 accept 1 acted 2 all 3 also 1 amounts 1 and 3 are 1 at 1 be 1 believe 1 best 2 carried 1 finally 1 for 1 government 3 governs 2 have 1 heartily 1 is 2 it 3 kind 1 least 1 like 1 men 1 more 1 motto 1 not 1 of 1 out 1 prepared 1rapidly 1 see 1 should 1 systematically 1 that 3 the 2 they 1 this 1 to 3 up 1 when 1 which 4 will 2

An Example Logical view of the documents accept acted all... government governs... d 1 1 2 3... 3 2... d 2 0 1 0... 2 2... d 3 0 2 0... 1 0... d 4 2 0 2... 2 1......

Documents as Vectors Documents are represented as vectors d j = (w 1,j,w 2,j,...,w t,j ) w i,j is the weight of term i in document j Queries are also vectors q = (w 1,q,w 2,q,...,w t,q ) Vector operations cab be used to compare queries documents (or documents documents)

An example Example Suppose the vocabulary has two terms k 1 = men, k 2 = government Two documents, d 1 and d 2 can be defined as, for instance d 1 = (2.2,5.2) d 2 = (4.9,1.0)

An example d 1 d 1 = (2.2, 5.2) d 2 = (4.9, 1.0) government d 2 men

Defining Document Vectors Two questions are still unanswered: 1 How do we define term weights? 2 How do we compare documents to queries?

Defining Term Weights TF Term frequency Term frequency is a measure of term importance within a document Definition Let N be the total number of documents in the system and n i be the number of documents in which term k i appears. The normalized frequency of a term k i in document d j is given by: f i,j = freq i,j max l freq l,j where freq i,j is the number of occurrences of term k i in document d j.

Defining Term Weights IDF (Inverse) Document frequency Document frequency is a measure of term importance within a collection Definition The inverse document frequency of a term k i is given by: idf i = log N n i

Defining Term Weights TF-IDF Definition The weight of a term k i in document d j for the vector space model is given by the tf-idf formula: w i,j = f i,j log N n i

Document Similarity Similarity between documents and queries is a measure of the correlation between their vectors Documents/queries that share the same terms, with similar weights, should be more similar Thus, as similarity a measure, we use the cosine of the angle between the vectors sim(d j, q) = d j q d j q = t i=1 w i,j w i,q t i=1 w2 i,j t i=1 w2 i,q

An example government α d 1 q cos(α) = 0.9 cos(θ) = 0.8 θ d 2 men

Outline 1 2 3 4 5

Traditional IR vs. IR Traditional IR systems Worth of a document regarding a query is intrinsic to the document. Documents are self-contained units Documents are descriptive and truthful The World Wide Indefinitely growing Non-textual content Documents are not self-complete No coherence of style, vocabulary, language,... Most web queries 2 words long

IR More information to explore Multimedia Images Video Sound (Semi-)Structured content Hyperlinks

Hyperlink graph analysis Hypermedia is a social network Social network theory Extensive research in applying graph notions Centrality and prestige Co-citation (relevance judgment) Applications search: HITS, Google Classification and topic distillation

Ranking Through Link Analysis Ranking search results Problems: Keyword queries are not selective enough Documents do not have enough text Solution: Use graph notions of popularity/prestige E.g., use the algorithm

Outline 1 2 3 4 5

Link Each page is a node without any textual properties Each hyperlink is an edge connecting two nodes with possibly only a positive edge weight property

Two perspectives The prestige of a page is proportional to the sum of the prestige scores of pages linking to it Idea of a random surfer on a strongly connected web graph

Overview of Pre-computes a rank-vector Provides a-priori (offline) importance estimates for all pages on Independent of search query In-degree prestige Not all votes are worth the same Prestige of a page depends on the prestige of citing pages Pre-compute query independent prestige score Query time: prestige scores used in conjunction with query-specific IR scores

The algorithm: E is adjacency matrix of the { 1 iff there is a link from u to v E[u, v] = 0 otherwise The out-degree of node u is given by N u = v E[u, v] Start with an initial prestige vector p 0 [u] Compute p i+1 [v] = (u,v) E p i [u] N u

Computing

Computing

Computing

Computing

Problems of Convergence graph is not strongly connected Only a fourth of the graph is! graph is not aperiodic Rank-sinks Pages without out-links Directed cyclic paths

A simple fix Two way choice at each node With a certain probability d (0.1 < d < 0.2), the surfer jumps to a random page on the With probability 1 d the surfer decides to choose, uniformly at random, an out-neighbor p i+1 [v] = d N + (1 d) (u,v) E p i [u] N u

architecture at Google Ranking of pages more important than exact values of p Convergence of page ranks in 52 iterations for a crawl with 322 million links. Pre-compute and store the of each page. independent of any query or textual content. Ranking scheme combines with textual match Unpublished Many empirical parameters, human effort and regression testing.

Questions?