Postings Lists - Reminder

Transcription

1 Introduction to Search Engine Technology Index Compression Ronny Lempel Yahoo! Labs, Haifa (Some of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research Lab) Postings Lists - Reminder The lexicon entry corresponding to term t points to t s postings list and also holds t s DF Logically, t s postings list is a list of posting elements corresponding to t s occurrences Each posting element contains a document identifier along with the offsets of t s occurrences within the document Sorted by increasing document identifiers Formally, for a term appearing in n t documents, x 1,x 2,,x nt : [(x1,f1,<o 1,,o f1 >),(x2,f2,<o 1,,o f2 >),, (xnt,fnt,<o 1,,o fnt >)] where xi < xi+1 and o j <o j+1 Efficient skipping mechanisms exist that enable reaching a position in a postings list without streaming through its prefix 23 November Search Engine Technology 2 1

2 Compression of Postings Lists Smaller, more compact postings lists mean less I/O! Or that larger indices can fit in RAM Key idea: since the doc-ids associated with each term t are in ascending order, encode each doc-id by its gap from the previous identifier This encoding is called d-gap encoding Example: transform the list [(x1,f1,<o 1,,o f1 >),(x2,f2,<o 1,,o f2 >),, (xnt,fnt,<o 1,,o fnt >)] into [(x1,f1,<o 1,,o f1 >),(x2-x 1,f2,<o 1,,o f2 >),, (xnt-x nt-1,fnt,<o 1,,o fnt >)] Note: the sequence of occurrence offsets within documents can also be encoded in a similar fashion 23 November Search Engine Technology 3 Why Use d-gaps? No information loss, but where is the saving? The largest d-gap in the 2 nd representation is potentially of the same order of magnitude as the largest document id in the 1 st representation If the index holds N documents, and a fixed binary encoding is used, both methods require log(n) bits per doc-id/d-gap However, frequent terms have d-gaps that are significantly smaller than the document identifiers in which they occur Consequently: use variable-length encoding schemes, in which small and/or frequent d-gap values will be encoded in less than log(n) bits Optimal choice of encoding scheme will depend on the probability distribution of the d-gaps and on decoding speeds 23 November Search Engine Technology 4 2

3 Integer Representations: Fixed Length vs. Variable Length Fixed length: log(n) bits per integer, where N is maximal possible integer Imposes a limit on the number of bits used per integer Does not compress: does not exploit the differences in the relative frequencies of the integers Variable length, prefix free encoding allows unbounded number representation with significant space savings Single gap variable length representations: Vint, Huffman, unary, γ, δ, Golomb encodings Multiple gap variable length representations: Group Varint, Simple9, PforDelta Simple9 doesn t support unbounded numbers, but is practical enough How much space savings is possible? 23 November Search Engine Technology 5 First Example - Vint A byte-aligned family of schemes, with each integer encoded by a variable number of bytes Simplest form - chaining: the leading bit of each byte indicates whether the number continues in an additional byte E.g. the 10-bit number : Alternatively, if the maximal integer is bounded, can encode in the leading bits of the leading byte the number of additional bytes needed to encode the given number Example for integers bounded by : Integers up to 2 6-1: Integers up to : Integers up to : Other variants exist, all easily decodable 0 0 x x x x x x 0 1 x x x x x x x x x x x x x x 1 0 x x x x x x x x x x x x x x x x x x x x x x 23 November Search Engine Technology 6 3

4 Group Varint Used by Google [Jeff Dean, Keynote at WSDM 2009] Encodes 4 integers in blocks of size 5-17 bytes First byte: four 2-bit binary length fields L 1 L 2 L 3 L 4, L j {1,2,3,4} Then, L1+L2+L3+L4 bytes (between 4-16) holding 4 numbers Each number can use 8/16/24/32 bits Reported to be about twice as fast to decode than (single) Vint schemes 23 November Search Engine Technology 7 Prefix-Free Coding of Integers Let N be the set of natural numbers and Σ the alphabet of the code In our case, Σ = {0,1} C: N Σ + is a prefix free code if for any two distinct natural numbers i,j, C(i) is not a prefix of C(j) Significance of prefix-free coding: codewords can be concatenated to each other without the need for any delimiters, and the resulting sequence remains uniquely decipherable Sometimes called comma-free codes 23 November Search Engine Technology 8 4

5 Shannon s Lossless Source Coding Theorem (simplified version) Let C: N {0,1} + be a binary encoding of the set of natural numbers, and let P: N [0,1] be a probability distribution on the set of natural numbers Denote by b i the number of bits in C(i), i.e. the length of i s encoding The expected length (in bits) of a codeword of C is thus R(C)=Σ i>0 p(i)b i (R(C) is also called the rate of the code C) Shannon: 1. For any code C, R(C) -Σ i>0 p(i) log 2 [p(i)] 2. An optimal code C* will achieve R(C*) < 1-Σ i>0 p(i) log 2 [p(i)] The quantity -Σ i>0 p(i) log 2 [p(i)] is called the Entropy of P and is denoted by H(P) 23 November Search Engine Technology 9 Unary Representation Perhaps the simplest prefix-free variable length representation of positive integers. X X-1 1 s followed by a single 0 E.g. 1 0, 3 110, By Shannon, optimal for the distribution Pr(x)=2 -x Since R(Unary Representation) = H( Pr(x)=2 -x ) The total length of gap representations in a postings list equals the ordinal number of the last document that includes the term Beats fixed-length encodings for terms that appear in more than N/log(N) documents 23 November Search Engine Technology 10 5

6 γ (Gamma) Coding Factor any x>0 into 2 e +d, where: e= log 2 x and 0 d < 2 e Represent e+1 in unary Represent d in binary, using e bits. E.g. 9= :001 Representation length: 2* log 2 x + 1 Optimal for 1/{2x 2 } < Pr(x) 1/{x 2 } OK for probability distribution? Gamma Code Integers x xx xxx xxxx Bits xxxxx November Search Engine Technology 11 Generalization: the δ (Delta) Code Factor x>0 into 2 e +d, where e= log 2 x and 0 d < 2 e Represent e+1 in γ Represent d in binary, using e bits Detailed example δ encoding of 9 9 = , i.e. e=3 and d=1 3+1 (e+1) in gamma is 110:00 1 (d) in 3-bit representation is 001 Altogether: 110:00:001 Length = 1 + log x + 2 log log 2x Optimal for P(x) 1/2x(log 2x) 2 δ Code Integers Bits x xx xxx xxxx xxxxx xxxxxx November Search Engine Technology 12 6

7 Encoding Lengths Comparison Number Unary Gamma Delta ,000,000 1,000, A fixed-length representation would require at least 20 bits per integer to encode a range of 1M 23 November Search Engine Technology 13 Golomb-Rice Codes Golomb codes are a parametric family of prefix codes that are very easy to implement They are distinguished from each other by a single parameter m The optimal choice of m depends on the probability distribution of the input sequence Rice coding is a special case of Golomb coding with m being a power of 2 Operations can then be done by masking and shifting bits 23 November Search Engine Technology 14 7

8 Golomb-Rice Coding (cont.) To encode an integer n using the Golomb code with parameter m=2 k : Write n as r*m+d, where r= n/m (the quotient) and 0 d < m is the remainder Represent r+1 in unary (since 0 doesn t have a unary encoding) Represent the remainder (n mod m) in binary using k bits Integer m=4 # bits m=8 # bits 0-3 0xx 3 0xxx xx 4 0xxx xx 5 10xxx xx 6 10xxx xx 7 110xxx 6 23 November Search Engine Technology 15 Matching Code to Distribution Unary coding is optimal when Pr(x)=2 -x Gamma is optimal when Pr(x) 1/(2x 2 ) Delta is optimal when Pr(x) 1/2x(log 2x) 2 Golomb-Rice is optimal when Pr(x)=(1-p) x-1 p, i.e. for Geometric distributions Provided that m is chosen such that (1-p) m + (1-p) m+1 1 < (1-p) m + (1-p) m-1 23 November Search Engine Technology 16 8

9 Golomb-Rice vs. the Rest For a word that appears in fraction p of the documents, let s consider that each document received the word independently with probability p (i.e. a Bernoulli process) This is an approximation, but a reasonable one in most cases Consequently, the d-gaps are distributed geometrically with parameter p, and Golomb-Rice encoding is optimal Can use different parameters for different postings lists, based on the DF of each term that is stored in the lexicon 23 November Search Engine Technology 17 Simple9 Encoding Scheme [Anh & Moffat, 2004] A word-aligned, multiple number encoding scheme Encoding block: 4 bytes (32 bits) Most significant nibble (4 bits) describe the layout of the 28 other bits as follows: 0: a single 28-bit number 1: two 14-bit numbers Layout (4 bits) n numbers of b bits each n * b 28 2: three 9-bit numbers (and one spare bit) 3: four 7-bit numbers 4: five 5-bit numbers (and three spare bits) 5: seven 4-bit numbers 6: nine 3-bit numbers (and one spare bit) 7: fourteen two-bit numbers 8: twenty-eight one-bit numbers Simple16 is a variant that defines 5 additional (uneven) configurations Can be efficiently decodable using bit masks 23 November Search Engine Technology 18 9

10 PForDelta [S. Heman, 2005] Encode a block of B integers together (e.g. B=128) Determine a percentage threshold x, such that x% of the B integers fit in k bits (e.g. x=90) Allocate an array of kb bits, and write any integer that fits in k bits in its corresponding slot; the minority of integers that don t fit in k bits are called exceptions. Encode the locations of the exceptions by chaining, using their unused k-bit slots in the array The index of the first exception is encoded before the array in log B bits Gap to next exception is encoded in k bits; if it doesn t fit in k bits, force an additional exception Encode the exceptions somehow after the log B + kb bits. 23 November Search Engine Technology 19 Practical Considerations Most search engines are believed to be using byte-aligned compression schemes While this favors Vint/Group-Varint and Simple9, one can also byte-align any of the other methods, by adding padding zeros When using d-gap compression on postings lists that support efficient skipping, each possible landing point of a skip (e.g. each block in a B + Tree) must start with an absolute docid rather than a d-gap from the previous postings element Offsets (locations) within documents are also encoded 23 November Search Engine Technology 20 10

11 DocID Assignment Problem The previous methods all compressed d-gaps; in all cases, small d- gaps are encoded by less bits than large ones Can documents be ordered (i.e. can document identifiers be assigned) such that the implied d-gaps are smaller and will thus compress better? c1 c2 c3 c4 c5 c6 c7 c c6 c3 c8 c1 c2 c5 c7 c November Search Engine Technology 21 DocID Assignment Problem (cont.) When framed as an optimization (minimization) problem of finding the best permutation of a set of documents, docid assignment is NP-Hard For small d-gaps - smaller than the expected N/df(t) - documents with similar terms should be assigned close docids Techniques applied: clustering, TSP approximations Observation: if a d-gap cannot be made smaller than N/df(t), try making it as large as possible (why?) N number of document; df(t) document frequency of term t 23 November Search Engine Technology 22 11

12 DocID Assignment by URL Sorting State of the art on Web collections is surprisingly simple ordering URLs by lexicographic order results in good compression (small d- gaps), as same-host pages that use similar vocabulary are grouped [Silvestri, ECIR 2007] Same topic, same page template, navigation bars, etc. Lexicographic URL sorting further preserves finer-grain site structure However, most of the benefit of this method is gained from simply grouping same-host documents together Lexicographic URL sorting is also key to compressing the Web graph [Boldi & Vigna, WWW 2004] 23 November Search Engine Technology 23 Further Research The following areas have been researched: Exploiting redundancy when indexing multiple documents with highly overlapping content: Near-duplicate Web pages Versioned documents (code, Wikipedia) threads (back-and-forth messages) Document assignment problems on partitioned indexes Achieving compact representations of the Web graph In particular, the adjacency lists can also be d-gap encoded 23 November Search Engine Technology 24 12