Introduction to Information Retrieval
|
|
- Basil Webster
- 8 years ago
- Views:
Transcription
1 Introduction to Information Retrieval IIR 6&7: Vector Space Model Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart Schütze: Vector space model 1 / 37
2 Models and Methods 1 Boolean model and its limitations (30) 2 Vector space model (30) 3 Probabilistic models (30) 4 Language model-based retrieval (30) 5 Latent semantic indexing (30) 6 Learning to rank (30) Schütze: Vector space model 3 / 37
3 Take-away Schütze: Vector space model 4 / 37
4 Take-away tf-idf weighting: Quick review of tf-idf weighting Schütze: Vector space model 4 / 37
5 Take-away tf-idf weighting: Quick review of tf-idf weighting Vector space model represents queries and documents in a high-dimensional space. Schütze: Vector space model 4 / 37
6 Take-away tf-idf weighting: Quick review of tf-idf weighting Vector space model represents queries and documents in a high-dimensional space. Pivot normalization (or pivoted document length normalization ): alternative to cosine normalization that removes a bias inherent in standard length normalization Schütze: Vector space model 4 / 37
7 Outline 1 tf-idf weighting 2 Vector space model 3 Pivot length normalization Schütze: Vector space model 5 / 37
8 Binary incidence matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is represented as a binary vector {0,1} V. Schütze: Vector space model 6 / 37
9 Binary incidence matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is represented as a binary vector {0,1} V. Schütze: Vector space model 6 / 37
10 Count matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is now represented as a count vector N V. Schütze: Vector space model 7 / 37
11 Count matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is now represented as a count vector N V. Schütze: Vector space model 7 / 37
12 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. Schütze: Vector space model 8 / 37
13 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. Schütze: Vector space model 8 / 37
14 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Schütze: Vector space model 8 / 37
15 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: Schütze: Vector space model 8 / 37
16 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. Schütze: Vector space model 8 / 37
17 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. But not 10 times more relevant. Schütze: Vector space model 8 / 37
18 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency. Schütze: Vector space model 8 / 37
19 Instead of raw frequency: Log frequency weighting Schütze: Vector space model 9 / 37
20 Instead of raw frequency: Log frequency weighting The log frequency weight of term t in d is defined as follows { 1 + log10 tf w t,d = t,d if tf t,d > 0 0 otherwise Schütze: Vector space model 9 / 37
21 Instead of raw frequency: Log frequency weighting The log frequency weight of term t in d is defined as follows { 1 + log10 tf w t,d = t,d if tf t,d > 0 0 otherwise tf t,d w t,d : 0 0, 1 1, 2 1.3, 10 2, , etc. Schütze: Vector space model 9 / 37
22 Instead of raw frequency: Log frequency weighting The log frequency weight of term t in d is defined as follows { 1 + log10 tf w t,d = t,d if tf t,d > 0 0 otherwise tf t,d w t,d : 0 0, 1 1, 2 1.3, 10 2, , etc. Matching score for a document-query pair: sum over terms t in both q and d: tf-matching-score(q,d) = t q d (1 + log tf t,d) Schütze: Vector space model 9 / 37
23 Frequency in document vs. frequency in collection In addition, to term frequency (the frequency of the term in the document)... Schütze: Vector space model 10 / 37
24 Frequency in document vs. frequency in collection In addition, to term frequency (the frequency of the term in the document)......we also want to use the frequency of the term in the collection for weighting and ranking. Schütze: Vector space model 10 / 37
25 idf weight Schütze: Vector space model 11 / 37
26 idf weight df t is the document frequency, the number of documents that t occurs in. Schütze: Vector space model 11 / 37
27 idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. Schütze: Vector space model 11 / 37
28 idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. Inverse document frequency, idf t, is a direct measure of the informativeness of the term. Schütze: Vector space model 11 / 37
29 idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. Inverse document frequency, idf t, is a direct measure of the informativeness of the term. The idf weight of term t is defined as follows: idf t = log 10 N df t (N is the number of documents in the collection.) Schütze: Vector space model 11 / 37
30 idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. Inverse document frequency, idf t, is a direct measure of the informativeness of the term. The idf weight of term t is defined as follows: idf t = log 10 N df t (N is the number of documents in the collection.) [log N/df t ] instead of [N/df t ] to dampen the effect of idf Schütze: Vector space model 11 / 37
31 Examples for idf Schütze: Vector space model 12 / 37
32 Examples for idf idf t = log 10 1,000,000 df t term df t idf t calpurnia 1 6 animal sunday fly 10,000 2 under 100,000 1 the 1,000,000 0 Schütze: Vector space model 12 / 37
33 Effect of idf on ranking Schütze: Vector space model 13 / 37
34 Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. Schütze: Vector space model 13 / 37
35 Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. Schütze: Vector space model 13 / 37
36 Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. idf affects the ranking of documents for queries with at least two terms. Schütze: Vector space model 13 / 37
37 Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. idf affects the ranking of documents for queries with at least two terms. For example, in the query arachnocentric line, idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line. Schütze: Vector space model 13 / 37
38 Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. idf affects the ranking of documents for queries with at least two terms. For example, in the query arachnocentric line, idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line. idf has little effect on ranking for one-term queries. Schütze: Vector space model 13 / 37
39 Summary: tf-idf weighting Schütze: Vector space model 14 / 37
40 Summary: tf-idf weighting Assign a tf-idf weight for each term t in each document d: w t,d = (1 + log tf t,d ) log N df t Schütze: Vector space model 14 / 37
41 Summary: tf-idf weighting Assign a tf-idf weight for each term t in each document d: w t,d = (1 + log tf t,d ) log N df t The tf-idf weight... Schütze: Vector space model 14 / 37
42 Summary: tf-idf weighting Assign a tf-idf weight for each term t in each document d: w t,d = (1 + log tf t,d ) log df N t The tf-idf weight increases with the number of occurrences within a document. (term frequency component) Schütze: Vector space model 14 / 37
43 Summary: tf-idf weighting Assign a tf-idf weight for each term t in each document d: w t,d = (1 + log tf t,d ) log df N t The tf-idf weight increases with the number of occurrences within a document. (term frequency component)... increases with the rarity of the term in the collection. (inverse document frequency component) Schütze: Vector space model 14 / 37
44 Outline 1 tf-idf weighting 2 Vector space model 3 Pivot length normalization Schütze: Vector space model 15 / 37
45 Binary incidence matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is represented as a binary vector {0,1} V. Schütze: Vector space model 16 / 37
46 Count matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is now represented as a count vector N V. Schütze: Vector space model 17 / 37
47 Binary count weight matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is now represented as a real-valued vector of tf-idf weights R V. Schütze: Vector space model 18 / 37
48 Binary count weight matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is now represented as a real-valued vector of tf-idf weights R V. Schütze: Vector space model 18 / 37
49 Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. Schütze: Vector space model 19 / 37
50 Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Schütze: Vector space model 19 / 37
51 Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Schütze: Vector space model 19 / 37
52 Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Schütze: Vector space model 19 / 37
53 Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines Schütze: Vector space model 19 / 37
54 Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines Each vector is very sparse - most entries are zero. Schütze: Vector space model 19 / 37
55 Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Schütze: Vector space model 20 / 37
56 Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query Schütze: Vector space model 20 / 37
57 Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity Schütze: Vector space model 20 / 37
58 Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity negative distance Schütze: Vector space model 20 / 37
59 Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity negative distance Recall: We re doing this because we want to get away from the you re-either-in-or-out, feast-or-famine Boolean model. Schütze: Vector space model 20 / 37
60 Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity negative distance Recall: We re doing this because we want to get away from the you re-either-in-or-out, feast-or-famine Boolean model. Instead: rank relevant documents higher than nonrelevant documents Schütze: Vector space model 20 / 37
61 How do we formalize vector space similarity? Schütze: Vector space model 21 / 37
62 How do we formalize vector space similarity? First cut: (negative) distance between two points Schütze: Vector space model 21 / 37
63 How do we formalize vector space similarity? First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Schütze: Vector space model 21 / 37
64 How do we formalize vector space similarity? First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Schütze: Vector space model 21 / 37
65 How do we formalize vector space similarity? First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea... Schütze: Vector space model 21 / 37
66 How do we formalize vector space similarity? First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea......because Euclidean distance is large for vectors of different lengths. Schütze: Vector space model 21 / 37
67 Why distance is a bad idea Schütze: Vector space model 22 / 37
68 Why distance is a bad idea poor 1 d 1 :Ranks of starving poets swell d 2 :Rich poor gap grows q:[rich poor] 0 d 3 :Record baseball salaries in 2010 rich 0 1 The Euclidean distance of q and d 2 is large although the distribution of terms in the query q and the distribution of terms in the document d 2 are very similar. Schütze: Vector space model 22 / 37
69 Use angle instead of distance Schütze: Vector space model 23 / 37
70 Use angle instead of distance Rank documents according to angle with query Schütze: Vector space model 23 / 37
71 Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Schütze: Vector space model 23 / 37
72 Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Schütze: Vector space model 23 / 37
73 Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order Schütze: Vector space model 23 / 37
74 Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order Cosine is a monotonically decreasing function of the angle for the interval [0,180 ] Schütze: Vector space model 23 / 37
75 Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order Cosine is a monotonically decreasing function of the angle for the interval [0,180 ] do ranking according to cosine Schütze: Vector space model 23 / 37
76 Cosine similarity between query and document Schütze: Vector space model 24 / 37
77 Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = V i=1 q i q di d Schütze: Vector space model 24 / 37
78 Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 q i q di d Schütze: Vector space model 24 / 37
79 Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 d i is the tf-idf weight of term i in the document. q i q di d Schütze: Vector space model 24 / 37
80 Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 d i is the tf-idf weight of term i in the document. q and d are the lengths of q and d. q i q di d Schütze: Vector space model 24 / 37
81 Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 d i is the tf-idf weight of term i in the document. q and d are the lengths of q and d. q i q di d This is the cosine similarity of q and d....or, equivalently, the cosine of the angle between q and d. Schütze: Vector space model 24 / 37
82 Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 d i is the tf-idf weight of term i in the document. q and d are the lengths of q and d. q i q di d This is the cosine similarity of q and d....or, equivalently, the cosine of the angle between q and d. cosine similarity = dot product of length-normalized vectors Schütze: Vector space model 24 / 37
83 Cosine similarity illustrated Schütze: Vector space model 25 / 37
84 Cosine similarity illustrated poor 1 v(d 1 ) v(q) v(d 2 ) θ v(d 3 ) rich Schütze: Vector space model 25 / 37
85 Components of tf-idf weighting Term frequency Document frequency Normalization n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft c (cosine) 1 w 2 1 +w w2 M a (augmented) b (boolean) tft,d maxt(tft,d) { 1 if tft,d > 0 0 otherwise p (prob idf) max{0,log N dft } u (pivoted dft unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(tft,d) 1+log(avet d(tft,d)) Schütze: Vector space model 26 / 37
86 Components of tf-idf weighting Term frequency Document frequency Normalization n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft c (cosine) 1 w 2 1 +w w2 M a (augmented) b (boolean) tft,d maxt(tft,d) { 1 if tft,d > 0 0 otherwise p (prob idf) max{0,log N dft } u (pivoted dft unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(tft,d) 1+log(avet d(tft,d)) Best known combination of weighting options Schütze: Vector space model 26 / 37
87 tf-idf example Schütze: Vector space model 27 / 37
88 tf-idf example We often use different weightings for queries and documents. Schütze: Vector space model 27 / 37
89 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Schütze: Vector space model 27 / 37
90 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn Schütze: Vector space model 27 / 37
91 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization Schütze: Vector space model 27 / 37
92 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Schütze: Vector space model 27 / 37
93 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Isn t it bad to not idf-weight the document? Schütze: Vector space model 27 / 37
94 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Isn t it bad to not idf-weight the document? Example query: best car insurance Schütze: Vector space model 27 / 37
95 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Isn t it bad to not idf-weight the document? Example query: best car insurance Example document: car insurance auto insurance Schütze: Vector space model 27 / 37
96 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37
97 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 best 1 car 1 insurance 1 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37
98 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 1 best 1 0 car 1 1 insurance 1 2 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37
99 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37
100 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37
101 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37
102 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37
103 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37
104 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37
105 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37
106 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight / / Schütze: Vector space model 28 / 37
107 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37
108 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Final similarity score between query and document: i w qi w di = = 3.08 Schütze: Vector space model 28 / 37
109 Outline 1 tf-idf weighting 2 Vector space model 3 Pivot length normalization Schütze: Vector space model 29 / 37
110 A problem for cosine normalization Schütze: Vector space model 30 / 37
111 A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Schütze: Vector space model 30 / 37
112 A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents Schütze: Vector space model 30 / 37
113 A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents d 1 : a short document on anti-doping rules at 2008 Olympics Schütze: Vector space model 30 / 37
114 A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents d 1 : a short document on anti-doping rules at 2008 Olympics d 2 : a long document that consists of a copy of d 1 and 5 other news stories, all on topics different from Olympics/anti-doping Schütze: Vector space model 30 / 37
115 A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents d 1 : a short document on anti-doping rules at 2008 Olympics d 2 : a long document that consists of a copy of d 1 and 5 other news stories, all on topics different from Olympics/anti-doping d 3 : a short document on anti-doping rules at the 2004 Athens Olympics Schütze: Vector space model 30 / 37
116 A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents d 1 : a short document on anti-doping rules at 2008 Olympics d 2 : a long document that consists of a copy of d 1 and 5 other news stories, all on topics different from Olympics/anti-doping d 3 : a short document on anti-doping rules at the 2004 Athens Olympics What ranking do we expect in the vector space model? Schütze: Vector space model 30 / 37
117 Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Schütze: Vector space model 31 / 37
118 Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Schütze: Vector space model 31 / 37
119 Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. Schütze: Vector space model 31 / 37
120 Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. This removes the unfair advantage that short documents have. Schütze: Vector space model 31 / 37
121 Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. This removes the unfair advantage that short documents have. Singhal s study is also interesting from the point of view of methodology. Schütze: Vector space model 31 / 37
122 Predicted and true probability of relevance Schütze: Vector space model 32 / 37
123 Predicted and true probability of relevance source: Lillian Lee Schütze: Vector space model 32 / 37
124 Pivot normalization Schütze: Vector space model 33 / 37
125 Pivot normalization source: Lillian Lee Schütze: Vector space model 33 / 37
126 Pivoted normalization: Amit Singhal s experiments Schütze: Vector space model 34 / 37
127 Pivoted normalization: Amit Singhal s experiments (relevant documents retrieved and (change in) average precision) Schütze: Vector space model 34 / 37
128 Summary: Ranked retrieval in the vector space model Schütze: Vector space model 35 / 37
129 Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Schütze: Vector space model 35 / 37
130 Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Schütze: Vector space model 35 / 37
131 Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Schütze: Vector space model 35 / 37
132 Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Alternatively, use pivot normalization Schütze: Vector space model 35 / 37
133 Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Alternatively, use pivot normalization Rank documents with respect to the query Schütze: Vector space model 35 / 37
134 Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Alternatively, use pivot normalization Rank documents with respect to the query Return the top K (e.g., K = 10) to the user Schütze: Vector space model 35 / 37
135 Take-away tf-idf weighting: Quick review of tf-idf weighting Vector space model represents queries and documents in a high-dimensional space. Pivot normalization (or pivoted document length normalization ): alternative to cosine normalization that removes a bias inherent in standard length normalization Schütze: Vector space model 36 / 37
136 Resources Chapters 6 and 7 of Introduction to Information Retrieval Resources at Gerard Salton (main proponent of vector space model in 70s, 80s, 90s) Exploring the similarity space (Moffat and Zobel, 2005) Pivot normalization (original paper) Schütze: Vector space model 37 / 37
TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt
TF-IDF David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt Administrative Homework 3 available soon Assignment 2 available soon Popular media article
More informationIntroduction to Information Retrieval http://informationretrieval.org
Introduction to Information Retrieval http://informationretrieval.org IIR 7: Scores in a Complete Search System Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-05-07
More informationLecture 5: Evaluation
Lecture 5: Evaluation Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk 1 Overview 1 Recap/Catchup
More informationEng. Mohammed Abdualal
Islamic University of Gaza Faculty of Engineering Computer Engineering Department Information Storage and Retrieval (ECOM 5124) IR HW 5+6 Scoring, term weighting and the vector space model Exercise 6.2
More informationHomework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9
Homework 2 Page 110: Exercise 6.10; Exercise 6.12 Page 116: Exercise 6.15; Exercise 6.17 Page 121: Exercise 6.19 Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 Page 131: Exercise 7.3; Exercise 7.5;
More informationLecture 1: Introduction and the Boolean Model
Lecture 1: Introduction and the Boolean Model Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk 1 Overview
More informationtypes of information systems computer-based information systems
topics: what is information systems? what is information? knowledge representation information retrieval cis20.2 design and implementation of software applications II spring 2008 session # II.1 information
More informationSearch Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc
Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,
More informationMining Text Data: An Introduction
Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo
More informationComparison of Standard and Zipf-Based Document Retrieval Heuristics
Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany
More information1 o Semestre 2007/2008
Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction
More informationLatent Semantic Indexing with Selective Query Expansion Abstract Introduction
Latent Semantic Indexing with Selective Query Expansion Andy Garron April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville PA 19426 Abstract This article describes
More informationW. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationIncorporating Window-Based Passage-Level Evidence in Document Retrieval
Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological
More informationSpazi vettoriali e misure di similaritá
Spazi vettoriali e misure di similaritá R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 April 10, 2009 Outline Outline Spazi vettoriali a valori reali Operazioni tra vettori Indipendenza Lineare
More informationText Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
More informationSupervised Learning Evaluation (via Sentiment Analysis)!
Supervised Learning Evaluation (via Sentiment Analysis)! Why Analyze Sentiment? Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents
More informationRecommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek
Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...
More informationThis lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis.
This lecture Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis. CSC401/2511 Spring 2015 2 Information retrieval systems Information retrieval
More informationw ki w kj k=1 w2 ki k=1 w2 kj F. Aiolli - Sistemi Informativi 2007/2008
RSV of the Vector Space Model The matching function RSV is the cosine of the angle between the two vectors RSV(d i,q j )=cos(α)= n k=1 w kiw kj d i 2 q j 2 = n k=1 n w ki w kj n k=1 w2 ki k=1 w2 kj 8 Note
More informationInformation Retrieval Models
Information Retrieval Models Djoerd Hiemstra University of Twente 1 Introduction author version Many applications that handle information on the internet would be completely inadequate without the support
More informationVector Spaces; the Space R n
Vector Spaces; the Space R n Vector Spaces A vector space (over the real numbers) is a set V of mathematical entities, called vectors, U, V, W, etc, in which an addition operation + is defined and in which
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationAn Information Retrieval using weighted Index Terms in Natural Language document collections
Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia
More information9 Multiplication of Vectors: The Scalar or Dot Product
Arkansas Tech University MATH 934: Calculus III Dr. Marcel B Finan 9 Multiplication of Vectors: The Scalar or Dot Product Up to this point we have defined what vectors are and discussed basic notation
More informationDevelopment of Secure Multikeyword Retrieval Methodology for Encrypted Cloud Data
Development of Secure Multikeyword Retrieval Methodology for Encrypted Cloud Data Deepak I M 1, K.R. Shylaja 2, Ravinandan M E 3 1 M.Tech IV Sem, Dept. of CSE, Dr. Ambedkar Institute of Technology, Bangalore,
More informationModern Information Retrieval: A Brief Overview
Modern Information Retrieval: A Brief Overview Amit Singhal Google, Inc. singhal@google.com Abstract For thousands of years people have realized the importance of archiving and finding information. With
More informationMath 215 HW #6 Solutions
Math 5 HW #6 Solutions Problem 34 Show that x y is orthogonal to x + y if and only if x = y Proof First, suppose x y is orthogonal to x + y Then since x, y = y, x In other words, = x y, x + y = (x y) T
More informationLinear Algebra Methods for Data Mining
Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Text mining & Information Retrieval Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationQuery Recommendation employing Query Logs in Search Optimization
1917 Query Recommendation employing Query Logs in Search Optimization Neha Singh Department of Computer Science, Shri Siddhi Vinayak Group of Institutions, Bareilly Email: singh26.neha@gmail.com Dr Manish
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationReview Jeopardy. Blue vs. Orange. Review Jeopardy
Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?
More informationFinding Advertising Keywords on Web Pages. Contextual Ads 101
Finding Advertising Keywords on Web Pages Scott Wen-tau Yih Joshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University Contextual Ads 101 Publisher s website Digital Camera Review The
More informationRecall that two vectors in are perpendicular or orthogonal provided that their dot
Orthogonal Complements and Projections Recall that two vectors in are perpendicular or orthogonal provided that their dot product vanishes That is, if and only if Example 1 The vectors in are orthogonal
More informationMAT 200, Midterm Exam Solution. a. (5 points) Compute the determinant of the matrix A =
MAT 200, Midterm Exam Solution. (0 points total) a. (5 points) Compute the determinant of the matrix 2 2 0 A = 0 3 0 3 0 Answer: det A = 3. The most efficient way is to develop the determinant along the
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationOrthogonal Projections
Orthogonal Projections and Reflections (with exercises) by D. Klain Version.. Corrections and comments are welcome! Orthogonal Projections Let X,..., X k be a family of linearly independent (column) vectors
More informationIndexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637
Indexing by Latent Semantic Analysis Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637 Susan T. Dumais George W. Furnas Thomas K. Landauer Bell Communications Research 435
More informationMatrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products
Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products H. Geuvers Institute for Computing and Information Sciences Intelligent Systems Version: spring 2015 H. Geuvers Version:
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationMathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
More informationdm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
More informationStemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System
Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System Hani Abu-Salem* and Mahmoud Al-Omari Department of Computer Science, Mu tah University, P.O. Box (7), Mu tah,
More informationInformation Retrieval Elasticsearch
Information Retrieval Elasticsearch IR Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches
More informationInteractive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs
Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Ryosuke Tsuchiya 1, Hironori Washizaki 1, Yoshiaki Fukazawa 1, Keishi Oshima 2, and Ryota Mibe
More informationAn Improved VSM-based Post-Requirements Traceability Recovery Approach Using Context Analysis
An Improved VSM-based Post-Requirements Traceability Recovery Approach Using Context Analysis Jiale Zhou, Yue Lu, Kristina Lundqvist Mälardalen Real-Time Research Centre, Mälardalen University, Västerås,
More informationAnti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
More informationSearch and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
More informationWhy is Internal Audit so Hard?
Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets
More information1 Boolean retrieval. Online edition (c)2009 Cambridge UP
DRAFT! April 1, 2009 Cambridge University Press. Feedback welcome. 1 1 Boolean retrieval INFORMATION RETRIEVAL The meaning of the term information retrieval can be very broad. Just getting a credit card
More informationClustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
More informationLINEAR ALGEBRA W W L CHEN
LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,
More informationMedical Information-Retrieval Systems. Dong Peng Medical Informatics Group
Medical Information-Retrieval Systems Dong Peng Medical Informatics Group Outline Evolution of medical Information-Retrieval (IR). The information retrieval process. The trend of medical information retrieval
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationLeast-Squares Intersection of Lines
Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
More informationTorgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances
Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances It is possible to construct a matrix X of Cartesian coordinates of points in Euclidean space when we know the Euclidean
More informationPDF hosted at the Radboud Repository of the Radboud University Nijmegen
PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this
More informationSEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India
More informationRecommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1
Recommender Systems Seminar Topic : Application Tung Do 28. Januar 2014 TU Darmstadt Thanh Tung Do 1 Agenda Google news personalization : Scalable Online Collaborative Filtering Algorithm, System Components
More informationReview: Information Retrieval Techniques and Applications
International Journal of Computer Networks and Communications Security VOL. 3, NO. 9, SEPTEMBER 2015, 373 377 Available online at: www.ijcncs.org E-ISSN 2308-9830 (Online) / ISSN 2410-0595 (Print) Review:
More informationMultimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.
Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1
More informationClassification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
More informationInformation Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007
Information Retrieval Lecture 8 - Relevance feedback and query expansion Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 32 Introduction An information
More informationA Comparison Framework of Similarity Metrics Used for Web Access Log Analysis
A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis Yusuf Yaslan and Zehra Cataltepe Istanbul Technical University, Computer Engineering Department, Maslak 34469 Istanbul, Turkey
More information1 VECTOR SPACES AND SUBSPACES
1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such
More informationDimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
More informationLinear Algebra Notes for Marsden and Tromba Vector Calculus
Linear Algebra Notes for Marsden and Tromba Vector Calculus n-dimensional Euclidean Space and Matrices Definition of n space As was learned in Math b, a point in Euclidean three space can be thought of
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad
More informationInverted Indexes: Trading Precision for Efficiency
Inverted Indexes: Trading Precision for Efficiency Yufei Tao KAIST April 1, 2013 After compression, an inverted index is often small enough to fit in memory. This benefits query processing because it avoids
More informationAn ontology-based approach for semantic ranking of the web search engines results
An ontology-based approach for semantic ranking of the web search engines results Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name
More informationRequirements Traceability Recovery
MASTER S THESIS Requirements Traceability Recovery - A Study of Available Tools - Author: Lina Brodén Supervisor: Markus Borg Examiner: Prof. Per Runeson April 2011 Abstract This master s thesis is focused
More informationMining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
More informationDevelopment of an Enhanced Web-based Automatic Customer Service System
Development of an Enhanced Web-based Automatic Customer Service System Ji-Wei Wu, Chih-Chang Chang Wei and Judy C.R. Tseng Department of Computer Science and Information Engineering Chung Hua University
More informationAdding vectors We can do arithmetic with vectors. We ll start with vector addition and related operations. Suppose you have two vectors
1 Chapter 13. VECTORS IN THREE DIMENSIONAL SPACE Let s begin with some names and notation for things: R is the set (collection) of real numbers. We write x R to mean that x is a real number. A real number
More informationData Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority
More informationCommon factor analysis
Common factor analysis This is what people generally mean when they say "factor analysis" This family of techniques uses an estimate of common variance among the original variables to generate the factor
More informationInteractive Information Retrieval Using Clustering and Spatial Proximity
User Modeling and User-Adapted Interaction 14: 259^288, 2004. 259 # 2004 Kluwer Academic Publishers. Printed in the Netherlands. Interactive Information Retrieval Using Clustering and Spatial Proximity
More informationSocial Business Intelligence Text Search System
Social Business Intelligence Text Search System Sagar Ligade ME Computer Engineering. Pune Institute of Computer Technology Pune, India ABSTRACT Today the search engine plays the important role in the
More informationA survey on the use of relevance feedback for information access systems
A survey on the use of relevance feedback for information access systems Ian Ruthven Department of Computer and Information Sciences University of Strathclyde, Glasgow, G1 1XH. Ian.Ruthven@cis.strath.ac.uk
More informationMATH 551 - APPLIED MATRIX THEORY
MATH 55 - APPLIED MATRIX THEORY FINAL TEST: SAMPLE with SOLUTIONS (25 points NAME: PROBLEM (3 points A web of 5 pages is described by a directed graph whose matrix is given by A Do the following ( points
More informationAmerican Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationEquations Involving Lines and Planes Standard equations for lines in space
Equations Involving Lines and Planes In this section we will collect various important formulas regarding equations of lines and planes in three dimensional space Reminder regarding notation: any quantity
More informationUsing Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents
Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents Michael J. Witbrock and Alexander G. Hauptmann Carnegie Mellon University ABSTRACT Library
More informationLinear Algebra Review. Vectors
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length
More informationInformation Retrieval on the Internet
Information Retrieval on the Internet Diana Inkpen Professor, University of Ottawa, 800 King Edward, Ottawa, ON, Canada, K1N 6N5 Tel. 1-613-562-5800 ext 6711, fax 1-613-562-5175 diana@site.uottawa.ca Outline
More informationCHAPTER VII CONCLUSIONS
CHAPTER VII CONCLUSIONS To do successful research, you don t need to know everything, you just need to know of one thing that isn t known. -Arthur Schawlow In this chapter, we provide the summery of the
More informationIntroduction to Data Mining
Introduction to Data Mining Instructor s Solution Manual Pang-Ning Tan Michael Steinbach Vipin Kumar Copyright c 2006 Pearson Addison-Wesley. All rights reserved. Contents 1 Introduction 1 2 Data 5 3
More informationManifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
More information2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system
1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables
More informationSolving Systems of Linear Equations
LECTURE 5 Solving Systems of Linear Equations Recall that we introduced the notion of matrices as a way of standardizing the expression of systems of linear equations In today s lecture I shall show how
More informationAn Analysis of Factors Used in Search Engine Ranking
An Analysis of Factors Used in Search Engine Ranking Albert Bifet 1 Carlos Castillo 2 Paul-Alexandru Chirita 3 Ingmar Weber 4 1 Technical University of Catalonia 2 University of Chile 3 L3S Research Center
More informationLegal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION
Brian Lao - bjlao Karthik Jagadeesh - kjag Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND There is a large need for improved access to legal help. For example,
More information3-Step Competency Prioritization Sequence
3-Step Competency Prioritization Sequence The Core Competencies for Public Health Professionals (Core Competencies), a consensus set of competencies developed by the Council on Linkages Between Academia
More informationA SYSTEM FOR CROWD ORIENTED EVENT DETECTION, TRACKING, AND SUMMARIZATION IN SOCIAL MEDIA
A SYSTEM FOR CROWD ORIENTED EVENT DETECTION, TRACKING, AND SUMMARIZATION IN SOCIAL MEDIA An Undergraduate Research Scholars Thesis by JASON M. BOLDEN Submitted to Honors and Undergraduate Research Texas
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationRecognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object
More informationExtra Credit Assignment Lesson plan. The following assignment is optional and can be completed to receive up to 5 points on a previously taken exam.
Extra Credit Assignment Lesson plan The following assignment is optional and can be completed to receive up to 5 points on a previously taken exam. The extra credit assignment is to create a typed up lesson
More information