Information Retrieval

Size: px

Start display at page:

Download "Information Retrieval"

Brook Johnson
7 years ago
Views:

1 Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann LECTURE 4 INDEX COMPRESSION Information Retrieval, ETHZ

2 Overview 1. Dictionary Compression 2. Zipf s Law 3. Posting List Compression 4. Gamma Codes 5. Golomb Code 6. Index Compression in Practice Information Retrieval, ETHZ

3 DICTIONARY COMPRESSION Information Retrieval, ETHZ

4 Vocabulary Growth: Heap s Law Can we assume there is an upper bound? Not really: the vocabulary will keep growing with collection size. Heaps law: M = kt β M is the size of the vocabulary, T is the number of tokens in the collection. Typical values for the parameters k and β are: 30 k 100 and β 0.5. Empirical law: Heaps law is linear, i.e., the simplest possible relationship between collection size and vocabulary size in log-log space. Information Retrieval, ETHZ

5 Dictionary Hash Table terms ETHZ class mountain weather h hashes r r+1 collision lists mountain ETHZ class weather n Fixed (known) function Storage need for token strings Information Retrieval, ETHZ

6 Dictionary as a String Information Retrieval, ETHZ

7 Dictionary as a String with Blocking Information Retrieval, ETHZ

8 Example: Space Estimate Example block size k = 4 Where we used 4 3 bytes for term pointers without blocking... we now use 3 bytes for one pointer plus 4 bytes for indicating the length of each term. We save 12 (3 + 4) = 5 bytes per block. Total savings: 400,000/4 5 = 0.5 MB This reduces the size of the Reuters dictionary from 7.6 MB to 7.1 MB. Information Retrieval, ETHZ

9 Front Coding Information Retrieval, ETHZ

10 Example: Dictionary Compression for Reuters CV1 Information Retrieval, ETHZ

11 ZIPF S LAW Information Retrieval, ETHZ

12 Zipf s Law We have characterized the growth of the vocabulary in collections with Heap s law. We also want know how many frequent vs. infrequent terms we should expect in a collection. In natural language, there are a few very frequent terms and very many very rare terms. Zipf s law: The i-th most frequent term has frequency proportional to 1/i, i.e., cf i 1/i cf i is collection frequency: the #occurrences of t i in coll. Equivalent: cf i =c*i k or log(cf i )= log(c)+k*log(i) (for k= 1) Example of a power law Information Retrieval, ETHZ

13 Example: Zipf s Law Fit is not perfect for Reuters CV1. What is important is the key insight: Few frequent terms, many rare terms. Information Retrieval, ETHZ

14 POSTING LIST COMPRESSION Information Retrieval, ETHZ

15 Posting List Compression The postings file is much larger than the dictionary, factor of at least 10. Key desideratum: store each posting compactly. A posting for our purposes is a doc-id. For Reuters (800,000 documents), we would use 32 bits per doc-id when using 4-byte integers. Alternatively, we can use log 2 800, bits per doc-id. Our goal: use a lot less than 20 bits per doc-id. Information Retrieval, ETHZ

16 Gap Encoding of Doc-IDs Each postings list is ordered in increasing order of doc-id. Example postings list: computer: , , ,... It suffices to store gaps: =5, =43 Example postings list: computer:... 5, 43,... Gaps for frequent terms are small. Thus: We can encode small gaps with fewer than 20 bits. Information Retrieval, ETHZ

17 VARIABLE LENGTH ENCODING Information Retrieval, ETHZ

18 Variable Length Encoding Aim: For arachnocentric and other rare terms, we will use about 20 bits per gap (= posting). For the and other very frequent terms, we will use about 1 bit per gap (= posting). In order to implement this, we need to devise some form of variable length encoding. Use few bits for small gaps, many bits for large gaps. Information Retrieval, ETHZ

19 Variable Byte Code Used by many commercial/research systems Good low-tech blend of variable-length coding and sensitivity to alignment matches (bit-level codes, see later). Dedicate 1 bit (high bit) to be a continuation bit c. If the gap G fits within 7 bits, binary-encode it in the 7 available bits and set c = 1. Else: encode higher-order 7 bits (padding) and then use one or more additional bytes to encode the lower order bits using the same algorithm. At the end set the continuation bit of the last byte to 1 (c = 1) and of the other bytes to 0 (c = 0). Information Retrieval, ETHZ

20 Variable Byte Code: Example Information Retrieval, ETHZ

21 GAMMA CODES Information Retrieval, ETHZ

22 Gamma Codes Even better compression with bit-level code Gamma code is the best known of these. Represent a gap G as a pair of length and offset. Offset is the gap in binary, with the leading bit chopped off. For example Length is the length of offset. For 13 (offset 101), this is 3. Encode length in unary code: Gamma code of 13 is the concatenation of length and offset: Information Retrieval, ETHZ

23 Unary Code Represent n as n 1s with a final 0. Unary code for 3 is Unary code for 40 is Information Retrieval, ETHZ

24 Gamma Code: Examples Information Retrieval, ETHZ

25 Length of Gamma Code The length of offset is log 2 G bits. The length of length is log 2 G + 1 bits, So the length of the entire code is 2 log2 G + 1 bits. Gamma codes are always of odd length. Gamma codes are within a factor of 2 of the optimal encoding length log2 G. Assuming equal-probability gaps but the distribution is actually highly skewed. Information Retrieval, ETHZ

26 Gamma Codes: Alignment Machines have word boundaries 8, 16, 32 bits Compressing and manipulating at individual bitgranularity can slow down query processing Variable byte alignment is potentially more efficient Regardless of efficiency, variable byte is conceptually simpler at little additional space cost Information Retrieval, ETHZ

27 Gamma Code: Encode Taken from en.wikipedia.org/wiki/elias_gamma_coding Information Retrieval, ETHZ

28 Gamma Code: Decode Taken from en.wikipedia.org/wiki/elias_gamma_coding Information Retrieval, ETHZ

29 GOLOMB CODE Information Retrieval, ETHZ

30 Shannon Limit Is it possible to derive codes that are optimal (under certain assumptions)? What is the optimal average code length for a code that encodes each integer (gap length) independently? Lower bound on average code length: Shannon entropy Asymptotically optimal codes (finite alphabets): arithmetic coding, Huffman codes Information Retrieval, ETHZ

31 Bernoulli Model Assumption: term occurrences are Bernoulli events Notation: n: number of documents m: number of terms in vocabulary N: total number of (unique) occurrences probability of term t j occurring in document d i : p=n/nm each term-document occurrence is an independent event Probability of a gap of length x is given by the geometric distribution Information Retrieval, ETHZ

32 Golomb Code Information Retrieval, ETHZ

33 Golomb Code Information Retrieval, ETHZ

34 Local Bernoulli Model If length of posting lists is known, then a Bernoulli model on each individual inverted list can be used Frequent words are coded with smaller b, infrequent words with larger b Term frequency need to be encoded (use gamma-code) Local Bernoulli outperforms global Bernoulli model in practice (method of choice!) Information Retrieval, ETHZ

35 Compression of Reuters: Summary Information Retrieval, ETHZ

36 INDEX COMPRESSION IN PRACTICE Information Retrieval, ETHZ

37 Block-Based Index Format Block-based, variable-length format to reduce space and CPU Reduced index size by ~30%, plus much faster to decode Information Retrieval, ETHZ

38 CPU Optimized Compression Block index format: very good compression, but CPUintensive to decode Better format: single flat position space Data structures on side keep track of doc boundaries Posting lists are just lists of delta-encoded positions Need to be compact (can t afford 32 bit value per occurrence) but need to be very fast to decode Information Retrieval, ETHZ

39 Improved Byte-Aligned Variable-Length Encodings Varint encoding: 7 bits per byte with continuation bit Con: Decoding requires lots of branches/shifts/masks Idea: Encode byte length as low 2 bits Better: fewer branches, shifts, and masks Con: Limited to 30-bit values, still some shifting to decode Information Retrieval, ETHZ

40 Group Varint Encoding Idea: encode groups of 4 values in 5-17 bytes Pull out 4 2-bit binary lengths into single byte prefix Decode: Load prefix byte and lookup value in 256-entry table Information Retrieval, ETHZ

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process