ISCL wintersemester 2007 IR Midterm exam. Exercise 2 : Characteristics of a collection and its index

Size: px

Start display at page:

Download "ISCL wintersemester 2007 IR Midterm exam. Exercise 2 : Characteristics of a collection and its index"

Angelina Clarke
7 years ago
Views:

1 ISCL wintersemester 2007 IR Midterm exam 17 December 2007 SOLUTIONS Non-electronic documents and calculators are authorized. Name : Semester : Exercise 1 : Definitions Define the following terms : tokenization segmentation of a document in order to produce a list of items (deals with punctuation, acronyms, dates, etc.) permuterm index index mapping all permutations of characters (including delimiters) of a given word to this word (used for wildcard queries) champion list pre-computed list of the r most relevant documents with respect to a given term Exercise 2 : Characteristics of a collection and its index Consider a collection made of documents, each containing on average 800 words. The number of different words (i.e. not taking duplicates into account) is estimated to For all questions, give your computation. What is the size (mega or giga bytes) of the collection when stored (uncompressed) on disc? = bytes = 2.4 GB With the best reduction rate of the dictionary achieved when using a linguistic preprocessing (noise words, stemming), what is the size (number of terms) of the dictionary? best reduction rate : 50 % /2 = keywords Consider an index where the average length of a non-positional posting list is 200. What is the estimation of the total number of postings of this index? = postings How many bytes do you allow respectively for encoding (without compression) a dictionary term? a non-positional posting? Dictionary term : 40 bytes for the keyword (unicode charset, 2 bytes per char, and maximum 20 characters per word), 4 bytes for the keyword frequency, and 3 bytes (pointer to the posting list, log 2 (350000) 19) = 47 bytes Posting : documents to refer to log 2 (500000) 19 bits 3 bytes 1

2 What are the size (mega or giga bytes) of the resulting dictionary and posting lists? Dictionary : = bytes = MB Postings : = bytes = 210 MB If you compress your dictionary using the dictionary-as-a-string method, what is the new size of the dictionary? ( ) = bytes = 9.1 MB (4 bytes for the term frequency, 3 bytes for the pointer to the posting list, 3 bytes for the pointer into the string, and 8 characters per word on average, each encoded with 2 bytes) Exercise 3 : Querying an index What kind of queries can be applied to the collection, for each of these, what index is needed? boolean queries : non-positional index phrase queries : positional index wildcard queries : permuterm index or n-gram index similarity query : frequency index Exercise 4 : Linguistic preprocessing Are the following statements right or false (justify your answer)? a) stemming increases retrieval precision. false. Stemming decreases precision since the flexion of words is ignored, many documents are retrieved even if they do not relate to the query (ex. Golden retriever vs. Gold retrieval). b) stemming only slightly reduces the size of the dictionary. false. Stemming can in some cases divide the size of the dictionary from 33 to 50 %. c) stop lists contains all most frequent terms. false. Stop lists contain some of the most frequent terms (a counter-example is the word water for English, which is among the most frequent but not included in stop-lists). Exercise 5 : Porter stemming What would be the result of the porter stemmer used with the following words? busses busses buss rely rely reli 2

3 realised realised realis What is the Porter measure of the following words (give your computation)? crepuscular cr ep usc ul ar C VC VC VC VC V m = 4 rigorous r ig or ous C VC VC VC V m = 3 placement pl ac em ent C VC VC VC V m = 3 Exercise 6 : Index architecture Propose a Map-Reduce architecture for creating language specific indexes from an heterogeneous collection. You can illustrate this architecture using a figure. Exercise 7 : Index compression What is the largest gap that can be encoded in 2 bytes using the variable-byte encoding? With 2 bytes, we use 2 continuation bits, and 14 bits are available for gap encoding (2 0 to 2 13 ). Hence, the largest gap that can be encoded is = (when all 14 bits are set to 1). What is the posting list that can be decoded from the variable byte-code ? 3

4 What would be the encoding of the same posting list using a γ-code? , , , Exercise 8 : Vector Space Model Consider a collection made of the documents d 1,d 2,d 3 and whose characteristics are the following : Term tf d1 tf d2 tf d3 df actor movie trailer Compute the vector representations of d 1, d 2 and d 3 using the tf idf t,d weighting and the euclidian normalisation. Estimation of the collection size : either you define your own (symbolic or not) collection size, or you use a heuristic such as the 3 keywords appear only together in d 1, d 2 and d 3. With the latter, the collection size is N = ( ) = 445. v( d 1 ) = v( d 2 ) = v( d 3 ) = 12 log 10 ( 445 ) D 15 log 10 ( ) D 52 log 10 ( 445 D 35 log 10 ( 445 ) D 24 log 10 ( ) D 13 log 10 ( 445 D 53 log 10 ( 445 ) D 48 log 10 ( ) D 12 log 10 ( 445 D where D = (12 log( 445 ))2 + (15 log( ))2 + (52 log( 445 )2 where D = where D = Compute the cosine similarities between these documents. (35 log( 445 ))2 + (24 log( ))2 + (13 log( 445 )2 (53 log( 445 ))2 + (48 log( ))2 + (12 log( 445 )2 s(d 1,d 2 ) = v( d 1 ).v( d 2 ) = (12 log( 445 ) 35 log(445 )) + (15 log( ) 24 log( )) + (52 log( ) 13 log(445 ) s(d 1,d 3 ) = v( d 1 ).v( d 3 ) = (12 log( 445 ) 55 log(445 )) + (15 log( ) 48 log( )) + (52 log( ) 12 log(445 ) s(d 2,d 3 ) = v( d 2 ).v( d 3 ) = (35 log( 445 ) 55 log(445 )) + (24 log( ) 48 log( )) + (13 log( ) 12 log(445 ) Give the ranking retrieved by the system for the query movie trailer. 4

5 We need the vector representation for the query q movie trailer. We can use the following : v( q) = Then we can compute the score of each document and rank them by decreasing order of score : score(q,d 1 ) = v( q).v( d 1 ) score(q,d 2 ) = v( q).v( d 2 ) score(q,d 3 ) = v( q).v( d 3 ) Exercise 9 : Term weighting Compute the vector representations of the documents introduced in the previous exercise using the ltn weighting scheme. By ltn, we mean the following measure (cf. lecture 6) : Hence, we obtain : idem for v( d 2 ) and v( d 3 ). tf t,d : 1 + log 10(tf t,d ) idf t : log 10 ( N df t ) normalisation : 1 (no normalisation) v( d (1 + log 10 (12)) log 10 ( 445 ) 1 ) = (1 + log 10 (15)) log 10 ( ) (1 + log 10 (52)) log 10 ( 445 Exercise 10 : Index architecture (extra credit) Consider a hashtable as a structure mapping keys to values using a hash function h such that h(key) = value. What problem may arise from such a structure when inserting new key-value pairs? For large collections of data, it may be hard (if not impossible) to guaranty the bijectivity of the hash function. Indeed, two different keys may be associated with the same value. In other terms, it is likely to happen that the mapping to encode in the hashtable has to deal with keys having the same hash : x y h(x) = h(y). What workaround would you propose for this insertion? Give an algorithm for inserting a key-value pair. 5

6 A workaround for the insertion of key-value pairs whose hash-value is identical consists of using a primary mapping and a secondary mapping. The latter contains the redundant pairs (i.e. the pairs with identical hash-values), that are themselves linked to the main pair in the primary index. In this context, the insertion algorithm checks the slot for the pair to be inserted in the primary hashtable. If it is unset, the pair is stored, otherwise the pair is stored at the end of the linked list of pairs in the secondary hashtable. proc insert(key k, value v, hashtable H, hashfunction h) int i = h(k) if (H[i].isUnset()) then H[i].key = k H[i].value = v H[i].next = -1 else int j = H[i].next int m = H.nextFree() int n = i while(j!= -1) // we traverse the linked list n = j j = H[j].next endwhile H[m].key = k // we store the duplicate hash-value H[m].value = v // in the first free slot H[m].next = -1 H[n].next = m // we link the previous end of the linked list endif endif 6

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,