Introduction to Information Retrieval

Size: px
Start display at page:

Download "Introduction to Information Retrieval http://informationretrieval.org"

Transcription

1 Introduction to Information Retrieval IIR 6&7: Vector Space Model Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart Schütze: Vector space model 1 / 37

2 Models and Methods 1 Boolean model and its limitations (30) 2 Vector space model (30) 3 Probabilistic models (30) 4 Language model-based retrieval (30) 5 Latent semantic indexing (30) 6 Learning to rank (30) Schütze: Vector space model 3 / 37

3 Take-away Schütze: Vector space model 4 / 37

4 Take-away tf-idf weighting: Quick review of tf-idf weighting Schütze: Vector space model 4 / 37

5 Take-away tf-idf weighting: Quick review of tf-idf weighting Vector space model represents queries and documents in a high-dimensional space. Schütze: Vector space model 4 / 37

6 Take-away tf-idf weighting: Quick review of tf-idf weighting Vector space model represents queries and documents in a high-dimensional space. Pivot normalization (or pivoted document length normalization ): alternative to cosine normalization that removes a bias inherent in standard length normalization Schütze: Vector space model 4 / 37

7 Outline 1 tf-idf weighting 2 Vector space model 3 Pivot length normalization Schütze: Vector space model 5 / 37

8 Binary incidence matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is represented as a binary vector {0,1} V. Schütze: Vector space model 6 / 37

9 Binary incidence matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is represented as a binary vector {0,1} V. Schütze: Vector space model 6 / 37

10 Count matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is now represented as a count vector N V. Schütze: Vector space model 7 / 37

11 Count matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is now represented as a count vector N V. Schütze: Vector space model 7 / 37

12 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. Schütze: Vector space model 8 / 37

13 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. Schütze: Vector space model 8 / 37

14 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Schütze: Vector space model 8 / 37

15 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: Schütze: Vector space model 8 / 37

16 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. Schütze: Vector space model 8 / 37

17 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. But not 10 times more relevant. Schütze: Vector space model 8 / 37

18 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency. Schütze: Vector space model 8 / 37

19 Instead of raw frequency: Log frequency weighting Schütze: Vector space model 9 / 37

20 Instead of raw frequency: Log frequency weighting The log frequency weight of term t in d is defined as follows { 1 + log10 tf w t,d = t,d if tf t,d > 0 0 otherwise Schütze: Vector space model 9 / 37

21 Instead of raw frequency: Log frequency weighting The log frequency weight of term t in d is defined as follows { 1 + log10 tf w t,d = t,d if tf t,d > 0 0 otherwise tf t,d w t,d : 0 0, 1 1, 2 1.3, 10 2, , etc. Schütze: Vector space model 9 / 37

22 Instead of raw frequency: Log frequency weighting The log frequency weight of term t in d is defined as follows { 1 + log10 tf w t,d = t,d if tf t,d > 0 0 otherwise tf t,d w t,d : 0 0, 1 1, 2 1.3, 10 2, , etc. Matching score for a document-query pair: sum over terms t in both q and d: tf-matching-score(q,d) = t q d (1 + log tf t,d) Schütze: Vector space model 9 / 37

23 Frequency in document vs. frequency in collection In addition, to term frequency (the frequency of the term in the document)... Schütze: Vector space model 10 / 37

24 Frequency in document vs. frequency in collection In addition, to term frequency (the frequency of the term in the document)......we also want to use the frequency of the term in the collection for weighting and ranking. Schütze: Vector space model 10 / 37

25 idf weight Schütze: Vector space model 11 / 37

26 idf weight df t is the document frequency, the number of documents that t occurs in. Schütze: Vector space model 11 / 37

27 idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. Schütze: Vector space model 11 / 37

28 idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. Inverse document frequency, idf t, is a direct measure of the informativeness of the term. Schütze: Vector space model 11 / 37

29 idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. Inverse document frequency, idf t, is a direct measure of the informativeness of the term. The idf weight of term t is defined as follows: idf t = log 10 N df t (N is the number of documents in the collection.) Schütze: Vector space model 11 / 37

30 idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. Inverse document frequency, idf t, is a direct measure of the informativeness of the term. The idf weight of term t is defined as follows: idf t = log 10 N df t (N is the number of documents in the collection.) [log N/df t ] instead of [N/df t ] to dampen the effect of idf Schütze: Vector space model 11 / 37

31 Examples for idf Schütze: Vector space model 12 / 37

32 Examples for idf idf t = log 10 1,000,000 df t term df t idf t calpurnia 1 6 animal sunday fly 10,000 2 under 100,000 1 the 1,000,000 0 Schütze: Vector space model 12 / 37

33 Effect of idf on ranking Schütze: Vector space model 13 / 37

34 Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. Schütze: Vector space model 13 / 37

35 Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. Schütze: Vector space model 13 / 37

36 Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. idf affects the ranking of documents for queries with at least two terms. Schütze: Vector space model 13 / 37

37 Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. idf affects the ranking of documents for queries with at least two terms. For example, in the query arachnocentric line, idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line. Schütze: Vector space model 13 / 37

38 Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. idf affects the ranking of documents for queries with at least two terms. For example, in the query arachnocentric line, idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line. idf has little effect on ranking for one-term queries. Schütze: Vector space model 13 / 37

39 Summary: tf-idf weighting Schütze: Vector space model 14 / 37

40 Summary: tf-idf weighting Assign a tf-idf weight for each term t in each document d: w t,d = (1 + log tf t,d ) log N df t Schütze: Vector space model 14 / 37

41 Summary: tf-idf weighting Assign a tf-idf weight for each term t in each document d: w t,d = (1 + log tf t,d ) log N df t The tf-idf weight... Schütze: Vector space model 14 / 37

42 Summary: tf-idf weighting Assign a tf-idf weight for each term t in each document d: w t,d = (1 + log tf t,d ) log df N t The tf-idf weight increases with the number of occurrences within a document. (term frequency component) Schütze: Vector space model 14 / 37

43 Summary: tf-idf weighting Assign a tf-idf weight for each term t in each document d: w t,d = (1 + log tf t,d ) log df N t The tf-idf weight increases with the number of occurrences within a document. (term frequency component)... increases with the rarity of the term in the collection. (inverse document frequency component) Schütze: Vector space model 14 / 37

44 Outline 1 tf-idf weighting 2 Vector space model 3 Pivot length normalization Schütze: Vector space model 15 / 37

45 Binary incidence matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is represented as a binary vector {0,1} V. Schütze: Vector space model 16 / 37

46 Count matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is now represented as a count vector N V. Schütze: Vector space model 17 / 37

47 Binary count weight matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is now represented as a real-valued vector of tf-idf weights R V. Schütze: Vector space model 18 / 37

48 Binary count weight matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Each document is now represented as a real-valued vector of tf-idf weights R V. Schütze: Vector space model 18 / 37

49 Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. Schütze: Vector space model 19 / 37

50 Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Schütze: Vector space model 19 / 37

51 Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Schütze: Vector space model 19 / 37

52 Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Schütze: Vector space model 19 / 37

53 Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines Schütze: Vector space model 19 / 37

54 Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines Each vector is very sparse - most entries are zero. Schütze: Vector space model 19 / 37

55 Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Schütze: Vector space model 20 / 37

56 Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query Schütze: Vector space model 20 / 37

57 Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity Schütze: Vector space model 20 / 37

58 Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity negative distance Schütze: Vector space model 20 / 37

59 Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity negative distance Recall: We re doing this because we want to get away from the you re-either-in-or-out, feast-or-famine Boolean model. Schütze: Vector space model 20 / 37

60 Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity negative distance Recall: We re doing this because we want to get away from the you re-either-in-or-out, feast-or-famine Boolean model. Instead: rank relevant documents higher than nonrelevant documents Schütze: Vector space model 20 / 37

61 How do we formalize vector space similarity? Schütze: Vector space model 21 / 37

62 How do we formalize vector space similarity? First cut: (negative) distance between two points Schütze: Vector space model 21 / 37

63 How do we formalize vector space similarity? First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Schütze: Vector space model 21 / 37

64 How do we formalize vector space similarity? First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Schütze: Vector space model 21 / 37

65 How do we formalize vector space similarity? First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea... Schütze: Vector space model 21 / 37

66 How do we formalize vector space similarity? First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea......because Euclidean distance is large for vectors of different lengths. Schütze: Vector space model 21 / 37

67 Why distance is a bad idea Schütze: Vector space model 22 / 37

68 Why distance is a bad idea poor 1 d 1 :Ranks of starving poets swell d 2 :Rich poor gap grows q:[rich poor] 0 d 3 :Record baseball salaries in 2010 rich 0 1 The Euclidean distance of q and d 2 is large although the distribution of terms in the query q and the distribution of terms in the document d 2 are very similar. Schütze: Vector space model 22 / 37

69 Use angle instead of distance Schütze: Vector space model 23 / 37

70 Use angle instead of distance Rank documents according to angle with query Schütze: Vector space model 23 / 37

71 Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Schütze: Vector space model 23 / 37

72 Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Schütze: Vector space model 23 / 37

73 Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order Schütze: Vector space model 23 / 37

74 Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order Cosine is a monotonically decreasing function of the angle for the interval [0,180 ] Schütze: Vector space model 23 / 37

75 Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order Cosine is a monotonically decreasing function of the angle for the interval [0,180 ] do ranking according to cosine Schütze: Vector space model 23 / 37

76 Cosine similarity between query and document Schütze: Vector space model 24 / 37

77 Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = V i=1 q i q di d Schütze: Vector space model 24 / 37

78 Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 q i q di d Schütze: Vector space model 24 / 37

79 Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 d i is the tf-idf weight of term i in the document. q i q di d Schütze: Vector space model 24 / 37

80 Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 d i is the tf-idf weight of term i in the document. q and d are the lengths of q and d. q i q di d Schütze: Vector space model 24 / 37

81 Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 d i is the tf-idf weight of term i in the document. q and d are the lengths of q and d. q i q di d This is the cosine similarity of q and d....or, equivalently, the cosine of the angle between q and d. Schütze: Vector space model 24 / 37

82 Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 d i is the tf-idf weight of term i in the document. q and d are the lengths of q and d. q i q di d This is the cosine similarity of q and d....or, equivalently, the cosine of the angle between q and d. cosine similarity = dot product of length-normalized vectors Schütze: Vector space model 24 / 37

83 Cosine similarity illustrated Schütze: Vector space model 25 / 37

84 Cosine similarity illustrated poor 1 v(d 1 ) v(q) v(d 2 ) θ v(d 3 ) rich Schütze: Vector space model 25 / 37

85 Components of tf-idf weighting Term frequency Document frequency Normalization n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft c (cosine) 1 w 2 1 +w w2 M a (augmented) b (boolean) tft,d maxt(tft,d) { 1 if tft,d > 0 0 otherwise p (prob idf) max{0,log N dft } u (pivoted dft unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(tft,d) 1+log(avet d(tft,d)) Schütze: Vector space model 26 / 37

86 Components of tf-idf weighting Term frequency Document frequency Normalization n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft c (cosine) 1 w 2 1 +w w2 M a (augmented) b (boolean) tft,d maxt(tft,d) { 1 if tft,d > 0 0 otherwise p (prob idf) max{0,log N dft } u (pivoted dft unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(tft,d) 1+log(avet d(tft,d)) Best known combination of weighting options Schütze: Vector space model 26 / 37

87 tf-idf example Schütze: Vector space model 27 / 37

88 tf-idf example We often use different weightings for queries and documents. Schütze: Vector space model 27 / 37

89 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Schütze: Vector space model 27 / 37

90 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn Schütze: Vector space model 27 / 37

91 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization Schütze: Vector space model 27 / 37

92 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Schütze: Vector space model 27 / 37

93 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Isn t it bad to not idf-weight the document? Schütze: Vector space model 27 / 37

94 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Isn t it bad to not idf-weight the document? Example query: best car insurance Schütze: Vector space model 27 / 37

95 tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Isn t it bad to not idf-weight the document? Example query: best car insurance Example document: car insurance auto insurance Schütze: Vector space model 27 / 37

96 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

97 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 best 1 car 1 insurance 1 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

98 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 1 best 1 0 car 1 1 insurance 1 2 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

99 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

100 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

101 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

102 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

103 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

104 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

105 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

106 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight / / Schütze: Vector space model 28 / 37

107 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

108 tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Final similarity score between query and document: i w qi w di = = 3.08 Schütze: Vector space model 28 / 37

109 Outline 1 tf-idf weighting 2 Vector space model 3 Pivot length normalization Schütze: Vector space model 29 / 37

110 A problem for cosine normalization Schütze: Vector space model 30 / 37

111 A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Schütze: Vector space model 30 / 37

112 A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents Schütze: Vector space model 30 / 37

113 A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents d 1 : a short document on anti-doping rules at 2008 Olympics Schütze: Vector space model 30 / 37

114 A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents d 1 : a short document on anti-doping rules at 2008 Olympics d 2 : a long document that consists of a copy of d 1 and 5 other news stories, all on topics different from Olympics/anti-doping Schütze: Vector space model 30 / 37

115 A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents d 1 : a short document on anti-doping rules at 2008 Olympics d 2 : a long document that consists of a copy of d 1 and 5 other news stories, all on topics different from Olympics/anti-doping d 3 : a short document on anti-doping rules at the 2004 Athens Olympics Schütze: Vector space model 30 / 37

116 A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents d 1 : a short document on anti-doping rules at 2008 Olympics d 2 : a long document that consists of a copy of d 1 and 5 other news stories, all on topics different from Olympics/anti-doping d 3 : a short document on anti-doping rules at the 2004 Athens Olympics What ranking do we expect in the vector space model? Schütze: Vector space model 30 / 37

117 Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Schütze: Vector space model 31 / 37

118 Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Schütze: Vector space model 31 / 37

119 Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. Schütze: Vector space model 31 / 37

120 Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. This removes the unfair advantage that short documents have. Schütze: Vector space model 31 / 37

121 Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. This removes the unfair advantage that short documents have. Singhal s study is also interesting from the point of view of methodology. Schütze: Vector space model 31 / 37

122 Predicted and true probability of relevance Schütze: Vector space model 32 / 37

123 Predicted and true probability of relevance source: Lillian Lee Schütze: Vector space model 32 / 37

124 Pivot normalization Schütze: Vector space model 33 / 37

125 Pivot normalization source: Lillian Lee Schütze: Vector space model 33 / 37

126 Pivoted normalization: Amit Singhal s experiments Schütze: Vector space model 34 / 37

127 Pivoted normalization: Amit Singhal s experiments (relevant documents retrieved and (change in) average precision) Schütze: Vector space model 34 / 37

128 Summary: Ranked retrieval in the vector space model Schütze: Vector space model 35 / 37

129 Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Schütze: Vector space model 35 / 37

130 Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Schütze: Vector space model 35 / 37

131 Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Schütze: Vector space model 35 / 37

132 Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Alternatively, use pivot normalization Schütze: Vector space model 35 / 37

133 Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Alternatively, use pivot normalization Rank documents with respect to the query Schütze: Vector space model 35 / 37

134 Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Alternatively, use pivot normalization Rank documents with respect to the query Return the top K (e.g., K = 10) to the user Schütze: Vector space model 35 / 37

135 Take-away tf-idf weighting: Quick review of tf-idf weighting Vector space model represents queries and documents in a high-dimensional space. Pivot normalization (or pivoted document length normalization ): alternative to cosine normalization that removes a bias inherent in standard length normalization Schütze: Vector space model 36 / 37

136 Resources Chapters 6 and 7 of Introduction to Information Retrieval Resources at Gerard Salton (main proponent of vector space model in 70s, 80s, 90s) Exploring the similarity space (Moffat and Zobel, 2005) Pivot normalization (original paper) Schütze: Vector space model 37 / 37

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt TF-IDF David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt Administrative Homework 3 available soon Assignment 2 available soon Popular media article

More information

Introduction to Information Retrieval http://informationretrieval.org

Introduction to Information Retrieval http://informationretrieval.org Introduction to Information Retrieval http://informationretrieval.org IIR 7: Scores in a Complete Search System Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-05-07

More information

Lecture 5: Evaluation

Lecture 5: Evaluation Lecture 5: Evaluation Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk 1 Overview 1 Recap/Catchup

More information

Eng. Mohammed Abdualal

Eng. Mohammed Abdualal Islamic University of Gaza Faculty of Engineering Computer Engineering Department Information Storage and Retrieval (ECOM 5124) IR HW 5+6 Scoring, term weighting and the vector space model Exercise 6.2

More information

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9 Homework 2 Page 110: Exercise 6.10; Exercise 6.12 Page 116: Exercise 6.15; Exercise 6.17 Page 121: Exercise 6.19 Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 Page 131: Exercise 7.3; Exercise 7.5;

More information

Lecture 1: Introduction and the Boolean Model

Lecture 1: Introduction and the Boolean Model Lecture 1: Introduction and the Boolean Model Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk 1 Overview

More information

types of information systems computer-based information systems

types of information systems computer-based information systems topics: what is information systems? what is information? knowledge representation information retrieval cis20.2 design and implementation of software applications II spring 2008 session # II.1 information

More information

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Comparison of Standard and Zipf-Based Document Retrieval Heuristics Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction Latent Semantic Indexing with Selective Query Expansion Andy Garron April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville PA 19426 Abstract This article describes

More information

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015 W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating Window-Based Passage-Level Evidence in Document Retrieval Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

More information

Spazi vettoriali e misure di similaritá

Spazi vettoriali e misure di similaritá Spazi vettoriali e misure di similaritá R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 April 10, 2009 Outline Outline Spazi vettoriali a valori reali Operazioni tra vettori Indipendenza Lineare

More information

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

More information

Supervised Learning Evaluation (via Sentiment Analysis)!

Supervised Learning Evaluation (via Sentiment Analysis)! Supervised Learning Evaluation (via Sentiment Analysis)! Why Analyze Sentiment? Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents

More information

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...

More information

This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis.

This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis. This lecture Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis. CSC401/2511 Spring 2015 2 Information retrieval systems Information retrieval

More information

w ki w kj k=1 w2 ki k=1 w2 kj F. Aiolli - Sistemi Informativi 2007/2008

w ki w kj k=1 w2 ki k=1 w2 kj F. Aiolli - Sistemi Informativi 2007/2008 RSV of the Vector Space Model The matching function RSV is the cosine of the angle between the two vectors RSV(d i,q j )=cos(α)= n k=1 w kiw kj d i 2 q j 2 = n k=1 n w ki w kj n k=1 w2 ki k=1 w2 kj 8 Note

More information

Information Retrieval Models

Information Retrieval Models Information Retrieval Models Djoerd Hiemstra University of Twente 1 Introduction author version Many applications that handle information on the internet would be completely inadequate without the support

More information

Vector Spaces; the Space R n

Vector Spaces; the Space R n Vector Spaces; the Space R n Vector Spaces A vector space (over the real numbers) is a set V of mathematical entities, called vectors, U, V, W, etc, in which an addition operation + is defined and in which

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

9 Multiplication of Vectors: The Scalar or Dot Product

9 Multiplication of Vectors: The Scalar or Dot Product Arkansas Tech University MATH 934: Calculus III Dr. Marcel B Finan 9 Multiplication of Vectors: The Scalar or Dot Product Up to this point we have defined what vectors are and discussed basic notation

More information

Development of Secure Multikeyword Retrieval Methodology for Encrypted Cloud Data

Development of Secure Multikeyword Retrieval Methodology for Encrypted Cloud Data Development of Secure Multikeyword Retrieval Methodology for Encrypted Cloud Data Deepak I M 1, K.R. Shylaja 2, Ravinandan M E 3 1 M.Tech IV Sem, Dept. of CSE, Dr. Ambedkar Institute of Technology, Bangalore,

More information

Modern Information Retrieval: A Brief Overview

Modern Information Retrieval: A Brief Overview Modern Information Retrieval: A Brief Overview Amit Singhal Google, Inc. singhal@google.com Abstract For thousands of years people have realized the importance of archiving and finding information. With

More information

Math 215 HW #6 Solutions

Math 215 HW #6 Solutions Math 5 HW #6 Solutions Problem 34 Show that x y is orthogonal to x + y if and only if x = y Proof First, suppose x y is orthogonal to x + y Then since x, y = y, x In other words, = x y, x + y = (x y) T

More information

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Text mining & Information Retrieval Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Query Recommendation employing Query Logs in Search Optimization

Query Recommendation employing Query Logs in Search Optimization 1917 Query Recommendation employing Query Logs in Search Optimization Neha Singh Department of Computer Science, Shri Siddhi Vinayak Group of Institutions, Bareilly Email: singh26.neha@gmail.com Dr Manish

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Finding Advertising Keywords on Web Pages. Contextual Ads 101 Finding Advertising Keywords on Web Pages Scott Wen-tau Yih Joshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University Contextual Ads 101 Publisher s website Digital Camera Review The

More information

Recall that two vectors in are perpendicular or orthogonal provided that their dot

Recall that two vectors in are perpendicular or orthogonal provided that their dot Orthogonal Complements and Projections Recall that two vectors in are perpendicular or orthogonal provided that their dot product vanishes That is, if and only if Example 1 The vectors in are orthogonal

More information

MAT 200, Midterm Exam Solution. a. (5 points) Compute the determinant of the matrix A =

MAT 200, Midterm Exam Solution. a. (5 points) Compute the determinant of the matrix A = MAT 200, Midterm Exam Solution. (0 points total) a. (5 points) Compute the determinant of the matrix 2 2 0 A = 0 3 0 3 0 Answer: det A = 3. The most efficient way is to develop the determinant along the

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Orthogonal Projections

Orthogonal Projections Orthogonal Projections and Reflections (with exercises) by D. Klain Version.. Corrections and comments are welcome! Orthogonal Projections Let X,..., X k be a family of linearly independent (column) vectors

More information

Indexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637

Indexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637 Indexing by Latent Semantic Analysis Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637 Susan T. Dumais George W. Furnas Thomas K. Landauer Bell Communications Research 435

More information

Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products

Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products H. Geuvers Institute for Computing and Information Sciences Intelligent Systems Version: spring 2015 H. Geuvers Version:

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

More information

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on

More information

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System Hani Abu-Salem* and Mahmoud Al-Omari Department of Computer Science, Mu tah University, P.O. Box (7), Mu tah,

More information

Information Retrieval Elasticsearch

Information Retrieval Elasticsearch Information Retrieval Elasticsearch IR Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches

More information

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Ryosuke Tsuchiya 1, Hironori Washizaki 1, Yoshiaki Fukazawa 1, Keishi Oshima 2, and Ryota Mibe

More information

An Improved VSM-based Post-Requirements Traceability Recovery Approach Using Context Analysis

An Improved VSM-based Post-Requirements Traceability Recovery Approach Using Context Analysis An Improved VSM-based Post-Requirements Traceability Recovery Approach Using Context Analysis Jiale Zhou, Yue Lu, Kristina Lundqvist Mälardalen Real-Time Research Centre, Mälardalen University, Västerås,

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Why is Internal Audit so Hard?

Why is Internal Audit so Hard? Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

More information

1 Boolean retrieval. Online edition (c)2009 Cambridge UP

1 Boolean retrieval. Online edition (c)2009 Cambridge UP DRAFT! April 1, 2009 Cambridge University Press. Feedback welcome. 1 1 Boolean retrieval INFORMATION RETRIEVAL The meaning of the term information retrieval can be very broad. Just getting a credit card

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

LINEAR ALGEBRA W W L CHEN

LINEAR ALGEBRA W W L CHEN LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,

More information

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group Medical Information-Retrieval Systems Dong Peng Medical Informatics Group Outline Evolution of medical Information-Retrieval (IR). The information retrieval process. The trend of medical information retrieval

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances It is possible to construct a matrix X of Cartesian coordinates of points in Euclidean space when we know the Euclidean

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this

More information

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

More information

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1 Recommender Systems Seminar Topic : Application Tung Do 28. Januar 2014 TU Darmstadt Thanh Tung Do 1 Agenda Google news personalization : Scalable Online Collaborative Filtering Algorithm, System Components

More information

Review: Information Retrieval Techniques and Applications

Review: Information Retrieval Techniques and Applications International Journal of Computer Networks and Communications Security VOL. 3, NO. 9, SEPTEMBER 2015, 373 377 Available online at: www.ijcncs.org E-ISSN 2308-9830 (Online) / ISSN 2410-0595 (Print) Review:

More information

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs. Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007

Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007 Information Retrieval Lecture 8 - Relevance feedback and query expansion Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 32 Introduction An information

More information

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis Yusuf Yaslan and Zehra Cataltepe Istanbul Technical University, Computer Engineering Department, Maslak 34469 Istanbul, Turkey

More information

1 VECTOR SPACES AND SUBSPACES

1 VECTOR SPACES AND SUBSPACES 1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such

More information

Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

More information

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Linear Algebra Notes for Marsden and Tromba Vector Calculus Linear Algebra Notes for Marsden and Tromba Vector Calculus n-dimensional Euclidean Space and Matrices Definition of n space As was learned in Math b, a point in Euclidean three space can be thought of

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad

More information

Inverted Indexes: Trading Precision for Efficiency

Inverted Indexes: Trading Precision for Efficiency Inverted Indexes: Trading Precision for Efficiency Yufei Tao KAIST April 1, 2013 After compression, an inverted index is often small enough to fit in memory. This benefits query processing because it avoids

More information

An ontology-based approach for semantic ranking of the web search engines results

An ontology-based approach for semantic ranking of the web search engines results An ontology-based approach for semantic ranking of the web search engines results Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name

More information

Requirements Traceability Recovery

Requirements Traceability Recovery MASTER S THESIS Requirements Traceability Recovery - A Study of Available Tools - Author: Lina Brodén Supervisor: Markus Borg Examiner: Prof. Per Runeson April 2011 Abstract This master s thesis is focused

More information

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

More information

Development of an Enhanced Web-based Automatic Customer Service System

Development of an Enhanced Web-based Automatic Customer Service System Development of an Enhanced Web-based Automatic Customer Service System Ji-Wei Wu, Chih-Chang Chang Wei and Judy C.R. Tseng Department of Computer Science and Information Engineering Chung Hua University

More information

Adding vectors We can do arithmetic with vectors. We ll start with vector addition and related operations. Suppose you have two vectors

Adding vectors We can do arithmetic with vectors. We ll start with vector addition and related operations. Suppose you have two vectors 1 Chapter 13. VECTORS IN THREE DIMENSIONAL SPACE Let s begin with some names and notation for things: R is the set (collection) of real numbers. We write x R to mean that x is a real number. A real number

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Common factor analysis

Common factor analysis Common factor analysis This is what people generally mean when they say "factor analysis" This family of techniques uses an estimate of common variance among the original variables to generate the factor

More information

Interactive Information Retrieval Using Clustering and Spatial Proximity

Interactive Information Retrieval Using Clustering and Spatial Proximity User Modeling and User-Adapted Interaction 14: 259^288, 2004. 259 # 2004 Kluwer Academic Publishers. Printed in the Netherlands. Interactive Information Retrieval Using Clustering and Spatial Proximity

More information

Social Business Intelligence Text Search System

Social Business Intelligence Text Search System Social Business Intelligence Text Search System Sagar Ligade ME Computer Engineering. Pune Institute of Computer Technology Pune, India ABSTRACT Today the search engine plays the important role in the

More information

A survey on the use of relevance feedback for information access systems

A survey on the use of relevance feedback for information access systems A survey on the use of relevance feedback for information access systems Ian Ruthven Department of Computer and Information Sciences University of Strathclyde, Glasgow, G1 1XH. Ian.Ruthven@cis.strath.ac.uk

More information

MATH 551 - APPLIED MATRIX THEORY

MATH 551 - APPLIED MATRIX THEORY MATH 55 - APPLIED MATRIX THEORY FINAL TEST: SAMPLE with SOLUTIONS (25 points NAME: PROBLEM (3 points A web of 5 pages is described by a directed graph whose matrix is given by A Do the following ( points

More information

American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Equations Involving Lines and Planes Standard equations for lines in space

Equations Involving Lines and Planes Standard equations for lines in space Equations Involving Lines and Planes In this section we will collect various important formulas regarding equations of lines and planes in three dimensional space Reminder regarding notation: any quantity

More information

Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents

Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents Michael J. Witbrock and Alexander G. Hauptmann Carnegie Mellon University ABSTRACT Library

More information

Linear Algebra Review. Vectors

Linear Algebra Review. Vectors Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length

More information

Information Retrieval on the Internet

Information Retrieval on the Internet Information Retrieval on the Internet Diana Inkpen Professor, University of Ottawa, 800 King Edward, Ottawa, ON, Canada, K1N 6N5 Tel. 1-613-562-5800 ext 6711, fax 1-613-562-5175 diana@site.uottawa.ca Outline

More information

CHAPTER VII CONCLUSIONS

CHAPTER VII CONCLUSIONS CHAPTER VII CONCLUSIONS To do successful research, you don t need to know everything, you just need to know of one thing that isn t known. -Arthur Schawlow In this chapter, we provide the summery of the

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Instructor s Solution Manual Pang-Ning Tan Michael Steinbach Vipin Kumar Copyright c 2006 Pearson Addison-Wesley. All rights reserved. Contents 1 Introduction 1 2 Data 5 3

More information

Manifold Learning Examples PCA, LLE and ISOMAP

Manifold Learning Examples PCA, LLE and ISOMAP Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition

More information

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system 1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables

More information

Solving Systems of Linear Equations

Solving Systems of Linear Equations LECTURE 5 Solving Systems of Linear Equations Recall that we introduced the notion of matrices as a way of standardizing the expression of systems of linear equations In today s lecture I shall show how

More information

An Analysis of Factors Used in Search Engine Ranking

An Analysis of Factors Used in Search Engine Ranking An Analysis of Factors Used in Search Engine Ranking Albert Bifet 1 Carlos Castillo 2 Paul-Alexandru Chirita 3 Ingmar Weber 4 1 Technical University of Catalonia 2 University of Chile 3 L3S Research Center

More information

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION Brian Lao - bjlao Karthik Jagadeesh - kjag Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND There is a large need for improved access to legal help. For example,

More information

3-Step Competency Prioritization Sequence

3-Step Competency Prioritization Sequence 3-Step Competency Prioritization Sequence The Core Competencies for Public Health Professionals (Core Competencies), a consensus set of competencies developed by the Council on Linkages Between Academia

More information

A SYSTEM FOR CROWD ORIENTED EVENT DETECTION, TRACKING, AND SUMMARIZATION IN SOCIAL MEDIA

A SYSTEM FOR CROWD ORIENTED EVENT DETECTION, TRACKING, AND SUMMARIZATION IN SOCIAL MEDIA A SYSTEM FOR CROWD ORIENTED EVENT DETECTION, TRACKING, AND SUMMARIZATION IN SOCIAL MEDIA An Undergraduate Research Scholars Thesis by JASON M. BOLDEN Submitted to Honors and Undergraduate Research Texas

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28 Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object

More information

Extra Credit Assignment Lesson plan. The following assignment is optional and can be completed to receive up to 5 points on a previously taken exam.

Extra Credit Assignment Lesson plan. The following assignment is optional and can be completed to receive up to 5 points on a previously taken exam. Extra Credit Assignment Lesson plan The following assignment is optional and can be completed to receive up to 5 points on a previously taken exam. The extra credit assignment is to create a typed up lesson

More information