Islamic University of Gaza Faculty of Engineering Computer Engineering Department Information Storage and Retrieval (ECOM 5124) IR HW 5+6 Scoring, term weighting and the vector space model Exercise 6.2 Eng. Mohammed Abdualal In Example 6.1 above with weights g1 = 0.2, g2 = 0.31 and g3 = 0.49, what are all the distinct score values a document may get? 0.2 author zone, 0.31 title zone, 0.49 body zone 1 if appear in all zones 0.51 if appear in author and title zones but not body 0.69 if appear in author and body zones but not title 0.8 if appear in title and body zones but not author
Exercise 6.5 Apply Equation 6.6 to the sample training set in Figure 6.5 to estimate the best value of g for this sample. Optimal g* is g* = (n 10r + n 01n ) / (n 10r + n 10n + n 01r + n 01n ) where n_01r is the number of training samples for which s_t(d_j, q_j) = 0 and s_b(d_j, q_j) = 1, i.e., the sample is labeled relevant and the term occurs in the body, but not the title, and n_10n is a labeled nonrelevant with the term occurring in the title and not the body. n 10r = 0 n 10n = 1 n 01r = 2 n 01n = 1 g* = 1 / 4 = 0.25 Exercise 6.6 For the value of g estimated in Exercise 6.5, compute the weighted zone score for each (query, document) example. How do these scores relate to the relevance judgments in Figure 6.5 (quantized to 0/1)? phi score Relevance judgment ---------------------------------------------- 1 1 R 2 3/4 NR 3 3/4 R 4 0 NR 5 1 R 6 3/4 R 7 1/4 NR Non-relevant seem to yield smaller scores.
Exercise 6.7 Why does the expression for g in (6.6) not involve training examples inwhich st(dt, qt) and sb(dt, qt) have the same value? score(d, q) = g * s T (d, q) + (1 g) * s B (d, q) Because when they have the same value it does not affect the value of g (not dependents) so in differential equation will removed. And the total error when apply the equation 6.4 are 0 for both. Exercise 6.9 What is the idf of a term that occurs in every document? Compare this with the use of stop word lists. IDF is zero. (IDF = log N/N = log 1 = 0) Like stop words occurs in every document but it not related to document. Exercise 6.10 Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in Figure 6.9. Compute the tf-idf weights for the terms car, auto, insurance, best, for each document, using the idf values from Figure 6.8. tf-idf = tf * idf terms Doc1 Doc2 Doc3 -------------------------------------------- Car 44.55 6.6 39.6 Auto 6.24 68.64 0 Insurance 0 53.46 46.98 Best 21 0 25.5
Exercise 6.11 Can the tf-idf weight of a term in a document exceed 1? Yes... Exercise 6.15 Recall the tf-idf weights computed in Exercise 6.10. Compute the Euclidean normalized document vectors for each of the documents, where each vector has four components, one for each of the four terms. car Auto Ins. best ---------------------------------- Doc1 = (0.897,0.125,0.000,0.423) Doc2 = (0.076,0.786,0.613,0.000) Doc3 = (0.595,0.000,0.706,0.383) Exercise 6.16 Verify that the sum of the squares of the components of each of the document vectors in Exercise 6.15 is 1 (to within rounding error). Why is this the case? Doc1 = 0.8972^2 + 0.1252^2 + 0.4232^2 = 0.999; Doc2 = 0.0762^2 + 0.7862^2 + 0.613^2 = 0.999; Doc3 = 0.5952^2 + 0.7062^2 + 0.3832^2 = 0.999 Because they're normalized (unit) vectors, and the length of unit vectors is 1
Exercise 6.17 With term weights as computed in Exercise 6.15, rank the three documents by computed score for the query car insurance, for each of the following cases of term weighting in the query: 1. The weight of a term is 1 if present in the query, 0 otherwise. 2. Euclidean normalized idf. 1. p=q d score(q,d)=p t1 +p t2 +p t3 +p t4 2.
Exercise 6.19 Compute the vector space similarity between the query digital cameras and the document digital cameras and video cameras by filling out the empty columns in Table 6.1. Assume N = 10,000,000, logarithmic term weighting (wf columns) for query and document, idf weighting for the query only and cosine normalization for the document only. Treat and as a stop word. Enter term counts in the tf columns. What is the final similarity score?
Exercise 6.20 Show that for the query affection, the relative ordering of the scores of the three documents in Figure 6.13 is the reverse of the ordering of the scores for the query jealous gossip. For the query affection, score(q, SaS) = 0.996, score(q, PaP) = 0.993, score(q, WH) = 0.847, so the order is SaS, PaP, WH. For the query jealous gossip(sum), score(q, SaS) = 0.104, score(q, PaP) = 0.12, score(q,wh) = 0.72, so the order is WH, PaP,SaS. So the latter case is the reverse order of the former case. Exercise 6.23 Refer to the tf and idf values for four terms and three documents in Exercise 6.10. Compute the two top scoring documents on the query best car insurancefor each of the following weighing schemes: (i) nnn.atc; (ii) ntc.atc. (i)