Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008
Outline 1 2 3 4 5
Outline 1 2 3 4 5
Exploiting Text How is text exploited? Two main directions Extraction
Extraction Entity and relationship (link) extraction Entity resolution/matching Other types of extraction: Events Opinions Sentiments IE bibliography http://scratchpad.wikia.com/wiki/dblife_bibs
Goals: Representation, organization, storage and access to information items in order to provide the user with easy access to information The emphasis is on information
vs. Data Data retrieval Given a specified condition (e.g. {lab, ethics} document), find all items that satisfy the condition retrieval Given a user query, find all items that contain information relevant to the user s needs However...... how do you characterize the user s information need?
vs. Data Data retrieval Given a specified condition (e.g. {lab, ethics} document), find all items that satisfy the condition retrieval Given a user query, find all items that contain information relevant to the user s needs However...... how do you characterize the user s information need?
Translating the user information need An example Find all pages containing information on the ethical treatment of animals for medical experiments. The pages should contain references to recent related scientific articles, together with an enumeration of known existing alternatives for different medical fields. try this on Google Usually this is translated to ethics animals medical experiments but is this a convenient translation?
Translating the user information need An example Find all pages containing information on the ethical treatment of animals for medical experiments. The pages should contain references to recent related scientific articles, together with an enumeration of known existing alternatives for different medical fields. try this on Google Usually this is translated to ethics animals medical experiments but is this a convenient translation?
Outline 1 2 3 4 5
IR Tasks Document processing Indexing Crawling Query processing Distributed IR String processing... processing Ad-hoc retrieval Classification Clustering Filtering Question answering...
The Process
s IR s Classic models Boolean Vector Probabilistic Fuzzy Extended Boolean... LSI Neural Networks... Belief Network Language s... Alternative models
Outline 1 2 3 4 5
Index Terms In the classic IR models, documents are represented by index terms full text/selected keywords structure/no structure Not all terms are equally useful index terms can be weighted We assume that terms are mutually independent this is, of course, a simplification
An Example Example document I heartily accept the motto, That government is best which governs least ; and I should like to see it acted up to more rapidly and systematically. Carried out, it finally amounts to this, which also I believe That government is best which governs not at all ; and when men are prepared for it, that will be the kind of government which they will have.
An Example Index terms I accept acted all also amounts and are at be believe best carried finally for government governs have heartily is it kind least like men more motto not of out prepared rapidly see should systematically that the they this to up when which will
An Example Index terms I 3 accept 1 acted 2 all 3 also 1 amounts 1 and 3 are 1 at 1 be 1 believe 1 best 2 carried 1 finally 1 for 1 government 3 governs 2 have 1 heartily 1 is 2 it 3 kind 1 least 1 like 1 men 1 more 1 motto 1 not 1 of 1 out 1 prepared 1rapidly 1 see 1 should 1 systematically 1 that 3 the 2 they 1 this 1 to 3 up 1 when 1 which 4 will 2
An Example Index terms I 3 accept 1 acted 2 all 3 also 1 amounts 1 and 3 are 1 at 1 be 1 believe 1 best 2 carried 1 finally 1 for 1 government 3 governs 2 have 1 heartily 1 is 2 it 3 kind 1 least 1 like 1 men 1 more 1 motto 1 not 1 of 1 out 1 prepared 1rapidly 1 see 1 should 1 systematically 1 that 3 the 2 they 1 this 1 to 3 up 1 when 1 which 4 will 2
An Example Logical view of the documents accept acted all... government governs... d 1 1 2 3... 3 2... d 2 0 1 0... 2 2... d 3 0 2 0... 1 0... d 4 2 0 2... 2 1......
Documents as Vectors Documents are represented as vectors d j = (w 1,j,w 2,j,...,w t,j ) w i,j is the weight of term i in document j Queries are also vectors q = (w 1,q,w 2,q,...,w t,q ) Vector operations cab be used to compare queries documents (or documents documents)
An example Example Suppose the vocabulary has two terms k 1 = men, k 2 = government Two documents, d 1 and d 2 can be defined as, for instance d 1 = (2.2,5.2) d 2 = (4.9,1.0)
An example d 1 d 1 = (2.2, 5.2) d 2 = (4.9, 1.0) government d 2 men
Defining Document Vectors Two questions are still unanswered: 1 How do we define term weights? 2 How do we compare documents to queries?
Defining Term Weights TF Term frequency Term frequency is a measure of term importance within a document Definition Let N be the total number of documents in the system and n i be the number of documents in which term k i appears. The normalized frequency of a term k i in document d j is given by: f i,j = freq i,j max l freq l,j where freq i,j is the number of occurrences of term k i in document d j.
Defining Term Weights IDF (Inverse) Document frequency Document frequency is a measure of term importance within a collection Definition The inverse document frequency of a term k i is given by: idf i = log N n i
Defining Term Weights TF-IDF Definition The weight of a term k i in document d j for the vector space model is given by the tf-idf formula: w i,j = f i,j log N n i
Document Similarity Similarity between documents and queries is a measure of the correlation between their vectors Documents/queries that share the same terms, with similar weights, should be more similar Thus, as similarity a measure, we use the cosine of the angle between the vectors sim(d j, q) = d j q d j q = t i=1 w i,j w i,q t i=1 w2 i,j t i=1 w2 i,q
An example government α d 1 q cos(α) = 0.9 cos(θ) = 0.8 θ d 2 men
Outline 1 2 3 4 5
Traditional IR vs. IR Traditional IR systems Worth of a document regarding a query is intrinsic to the document. Documents are self-contained units Documents are descriptive and truthful The World Wide Indefinitely growing Non-textual content Documents are not self-complete No coherence of style, vocabulary, language,... Most web queries 2 words long
IR More information to explore Multimedia Images Video Sound (Semi-)Structured content Hyperlinks
Hyperlink graph analysis Hypermedia is a social network Social network theory Extensive research in applying graph notions Centrality and prestige Co-citation (relevance judgment) Applications search: HITS, Google Classification and topic distillation
Ranking Through Link Analysis Ranking search results Problems: Keyword queries are not selective enough Documents do not have enough text Solution: Use graph notions of popularity/prestige E.g., use the algorithm
Outline 1 2 3 4 5
Link Each page is a node without any textual properties Each hyperlink is an edge connecting two nodes with possibly only a positive edge weight property
Two perspectives The prestige of a page is proportional to the sum of the prestige scores of pages linking to it Idea of a random surfer on a strongly connected web graph
Overview of Pre-computes a rank-vector Provides a-priori (offline) importance estimates for all pages on Independent of search query In-degree prestige Not all votes are worth the same Prestige of a page depends on the prestige of citing pages Pre-compute query independent prestige score Query time: prestige scores used in conjunction with query-specific IR scores
The algorithm: E is adjacency matrix of the { 1 iff there is a link from u to v E[u, v] = 0 otherwise The out-degree of node u is given by N u = v E[u, v] Start with an initial prestige vector p 0 [u] Compute p i+1 [v] = (u,v) E p i [u] N u
Computing
Computing
Computing
Computing
Problems of Convergence graph is not strongly connected Only a fourth of the graph is! graph is not aperiodic Rank-sinks Pages without out-links Directed cyclic paths
A simple fix Two way choice at each node With a certain probability d (0.1 < d < 0.2), the surfer jumps to a random page on the With probability 1 d the surfer decides to choose, uniformly at random, an out-neighbor p i+1 [v] = d N + (1 d) (u,v) E p i [u] N u
architecture at Google Ranking of pages more important than exact values of p Convergence of page ranks in 52 iterations for a crawl with 322 million links. Pre-compute and store the of each page. independent of any query or textual content. Ranking scheme combines with textual match Unpublished Many empirical parameters, human effort and regression testing.
Questions?