Postings Lists - Reminder

Similar documents
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Compression techniques

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

The WebGraph Framework:Compression Techniques

Information, Entropy, and Coding

International Journal of Advanced Research in Computer Science and Software Engineering

On the Use of Compression Algorithms for Network Traffic Classification

Image Compression through DCT and Huffman Coding Technique

On Inverted Index Compression for Search Engine Efficiency

Section 1.4 Place Value Systems of Numeration in Other Bases

Storage Optimization in Cloud Environment using Compression Algorithm

Indexing and Compression of Text

Analysis of Compression Algorithms for Program Data

Modified Golomb-Rice Codes for Lossless Compression of Medical Images

Diffusion and Data compression for data security. A.J. Han Vinck University of Duisburg/Essen April 2013

A Catalogue of the Steiner Triple Systems of Order 19

APP INVENTOR. Test Review

Gambling and Data Compression

b) since the remainder is 0 I need to factor the numerator. Synthetic division tells me this is true

CHAPTER 2 LITERATURE REVIEW

An Introduction to Information Theory

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

Arithmetic Coding: Introduction

Streaming Lossless Data Compression Algorithm (SLDC)

Storage Management for Files of Dynamic Records

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Class Notes CS Creating and Using a Huffman Code. Ref: Weiss, page 433

Probability Interval Partitioning Entropy Codes

Record Storage and Primary File Organization

DNA Sequencing Data Compression. Michael Chung

Chapter 12 File Management

Chapter 12 File Management. Roadmap

winhex Disk Editor, RAM Editor PRESENTED BY: OMAR ZYADAT and LOAI HATTAR

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

Entropy and Mutual Information

Khalid Sayood and Martin C. Rost Department of Electrical Engineering University of Nebraska

Classification - Examples

On the Unique Games Conjecture

Lossless Compression of Cloud-Cover Forecasts for Low-Overhead Distribution in Solar-Harvesting Sensor Networks

Calculator Notes for the TI-Nspire and TI-Nspire CAS

SMALL INDEX LARGE INDEX (SILT)

We can express this in decimal notation (in contrast to the underline notation we have been using) as follows: b + 90c = c + 10b

Reading.. IMAGE COMPRESSION- I IMAGE COMPRESSION. Image compression. Data Redundancy. Lossy vs Lossless Compression. Chapter 8.

1/18/ year Ph.D. student internships in. Done job hunting recently. Will join Yahoo! Labs soon. Interviewed with

Algorithms for Advanced Packet Classification with Ternary CAMs

Introduction to Parallel Programming and MapReduce

Encoding Text with a Small Alphabet

Intro to the Art of Computer Science

Lecture 11: Number Systems

Storage and File Structure

Scheduling Shop Scheduling. Tim Nieberg

On Directed Information and Gambling

Parquet. Columnar storage for the people

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

Offline sorting buffers on Line

CHAPTER 17: File Management

Web Graph Visualizer. AUTOMATYKA 2011 Tom 15 Zeszyt Introduction. Micha³ Sima*, Wojciech Bieniecki*, Szymon Grabowski*

Counters and Decoders

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

=

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

The Goldberg Rao Algorithm for the Maximum Flow Problem

JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004

encoding compression encryption

5.1 Bipartite Matching

Storing Measurement Data

How To Code With Cbcc (Cbcc) In Video Coding

CSE 123: Computer Networks Fall Quarter, 2014 MIDTERM EXAM

Base Conversion written by Cathy Saxton

Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Transformation of LOG file using LIPT technique

An overview of FAT12

Multimedia Systems WS 2010/2011

Today s topics. Digital Computers. More on binary. Binary Digits (Bits)

Chapter 7 Memory Management

Introduction to image coding

Catch Me If You Can: A Practical Framework to Evade Censorship in Information-Centric Networks

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

CMPE 150 Winter 2009

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM

Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range

Zeros of Polynomial Functions

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

BM307 File Organization

Self-Indexing Inverted Files for Fast Text Retrieval

RN-coding of Numbers: New Insights and Some Applications

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 13. Random Variables: Distribution and Expectation

Useful Number Systems

Binary search tree with SIMD bandwidth optimization using SSE

IBM SPSS Direct Marketing 23

Fast Arithmetic Coding (FastAC) Implementations

Language Modeling. Chapter Introduction

Two Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface

Solutions to Problem Set 1

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Transcription:

Introduction to Search Engine Technology Index Compression Ronny Lempel Yahoo! Labs, Haifa (Some of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research Lab) Postings Lists - Reminder The lexicon entry corresponding to term t points to t s postings list and also holds t s DF Logically, t s postings list is a list of posting elements corresponding to t s occurrences Each posting element contains a document identifier along with the offsets of t s occurrences within the document Sorted by increasing document identifiers Formally, for a term appearing in n t documents, x 1,x 2,,x nt : [(x1,f1,<o 1,,o f1 >),(x2,f2,<o 1,,o f2 >),, (xnt,fnt,<o 1,,o fnt >)] where xi < xi+1 and o j <o j+1 Efficient skipping mechanisms exist that enable reaching a position in a postings list without streaming through its prefix 23 November 2011 236620 Search Engine Technology 2 1

Compression of Postings Lists Smaller, more compact postings lists mean less I/O! Or that larger indices can fit in RAM Key idea: since the doc-ids associated with each term t are in ascending order, encode each doc-id by its gap from the previous identifier This encoding is called d-gap encoding Example: transform the list [(x1,f1,<o 1,,o f1 >),(x2,f2,<o 1,,o f2 >),, (xnt,fnt,<o 1,,o fnt >)] into [(x1,f1,<o 1,,o f1 >),(x2-x 1,f2,<o 1,,o f2 >),, (xnt-x nt-1,fnt,<o 1,,o fnt >)] Note: the sequence of occurrence offsets within documents can also be encoded in a similar fashion 23 November 2011 236620 Search Engine Technology 3 Why Use d-gaps? No information loss, but where is the saving? The largest d-gap in the 2 nd representation is potentially of the same order of magnitude as the largest document id in the 1 st representation If the index holds N documents, and a fixed binary encoding is used, both methods require log(n) bits per doc-id/d-gap However, frequent terms have d-gaps that are significantly smaller than the document identifiers in which they occur Consequently: use variable-length encoding schemes, in which small and/or frequent d-gap values will be encoded in less than log(n) bits Optimal choice of encoding scheme will depend on the probability distribution of the d-gaps and on decoding speeds 23 November 2011 236620 Search Engine Technology 4 2

Integer Representations: Fixed Length vs. Variable Length Fixed length: log(n) bits per integer, where N is maximal possible integer Imposes a limit on the number of bits used per integer Does not compress: does not exploit the differences in the relative frequencies of the integers Variable length, prefix free encoding allows unbounded number representation with significant space savings Single gap variable length representations: Vint, Huffman, unary, γ, δ, Golomb encodings Multiple gap variable length representations: Group Varint, Simple9, PforDelta Simple9 doesn t support unbounded numbers, but is practical enough How much space savings is possible? 23 November 2011 236620 Search Engine Technology 5 First Example - Vint A byte-aligned family of schemes, with each integer encoded by a variable number of bytes Simplest form - chaining: the leading bit of each byte indicates whether the number continues in an additional byte E.g. the 10-bit number 1101001011: 1 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 Alternatively, if the maximal integer is bounded, can encode in the leading bits of the leading byte the number of additional bytes needed to encode the given number Example for integers bounded by 2 30-1: Integers up to 2 6-1: Integers up to 2 14-1: Integers up to 2 22-1: Other variants exist, all easily decodable 0 0 x x x x x x 0 1 x x x x x x x x x x x x x x 1 0 x x x x x x x x x x x x x x x x x x x x x x 23 November 2011 236620 Search Engine Technology 6 3

Group Varint Used by Google [Jeff Dean, Keynote at WSDM 2009] Encodes 4 integers in blocks of size 5-17 bytes First byte: four 2-bit binary length fields L 1 L 2 L 3 L 4, L j {1,2,3,4} Then, L1+L2+L3+L4 bytes (between 4-16) holding 4 numbers Each number can use 8/16/24/32 bits Reported to be about twice as fast to decode than (single) Vint schemes 23 November 2011 236620 Search Engine Technology 7 Prefix-Free Coding of Integers Let N be the set of natural numbers and Σ the alphabet of the code In our case, Σ = {0,1} C: N Σ + is a prefix free code if for any two distinct natural numbers i,j, C(i) is not a prefix of C(j) Significance of prefix-free coding: codewords can be concatenated to each other without the need for any delimiters, and the resulting sequence remains uniquely decipherable Sometimes called comma-free codes 23 November 2011 236620 Search Engine Technology 8 4

Shannon s Lossless Source Coding Theorem (simplified version) Let C: N {0,1} + be a binary encoding of the set of natural numbers, and let P: N [0,1] be a probability distribution on the set of natural numbers Denote by b i the number of bits in C(i), i.e. the length of i s encoding The expected length (in bits) of a codeword of C is thus R(C)=Σ i>0 p(i)b i (R(C) is also called the rate of the code C) Shannon: 1. For any code C, R(C) -Σ i>0 p(i) log 2 [p(i)] 2. An optimal code C* will achieve R(C*) < 1-Σ i>0 p(i) log 2 [p(i)] The quantity -Σ i>0 p(i) log 2 [p(i)] is called the Entropy of P and is denoted by H(P) 23 November 2011 236620 Search Engine Technology 9 Unary Representation Perhaps the simplest prefix-free variable length representation of positive integers. X X-1 1 s followed by a single 0 E.g. 1 0, 3 110, 5 11110 By Shannon, optimal for the distribution Pr(x)=2 -x Since R(Unary Representation) = H( Pr(x)=2 -x ) The total length of gap representations in a postings list equals the ordinal number of the last document that includes the term Beats fixed-length encodings for terms that appear in more than N/log(N) documents 23 November 2011 236620 Search Engine Technology 10 5

γ (Gamma) Coding Factor any x>0 into 2 e +d, where: e= log 2 x and 0 d < 2 e Represent e+1 in unary Represent d in binary, using e bits. E.g. 9=2 3 +1 1110:001 Representation length: 2* log 2 x + 1 Optimal for 1/{2x 2 } < Pr(x) 1/{x 2 } OK for probability distribution? Gamma Code Integers 0 1 1 10x 2-3 3 110xx 4-7 5 1110xxx 8-15 7 11110xxxx 16-31 9 Bits 111110xxxxx 32-63 11.... 23 November 2011 236620 Search Engine Technology 11 Generalization: the δ (Delta) Code Factor x>0 into 2 e +d, where e= log 2 x and 0 d < 2 e Represent e+1 in γ Represent d in binary, using e bits Detailed example δ encoding of 9 9 = 2 3 +1, i.e. e=3 and d=1 3+1 (e+1) in gamma is 110:00 1 (d) in 3-bit representation is 001 Altogether: 110:00:001 Length = 1 + log x + 2 log log 2x Optimal for P(x) 1/2x(log 2x) 2 δ Code Integers Bits 0 1 1 100x 2-3 4 101xx 4-7 5 11000xxx 8-15 8 11001xxxx 16-31 9 11010xxxxx 32-63 10 11011xxxxxx 64-127 11 23 November 2011 236620 Search Engine Technology 12 6

Encoding Lengths Comparison Number Unary Gamma Delta 1 1 1 1 2 2 3 4 8 8 7 8 1,000,000 1,000,000 39 28 A fixed-length representation would require at least 20 bits per integer to encode a range of 1M 23 November 2011 236620 Search Engine Technology 13 Golomb-Rice Codes Golomb codes are a parametric family of prefix codes that are very easy to implement They are distinguished from each other by a single parameter m The optimal choice of m depends on the probability distribution of the input sequence Rice coding is a special case of Golomb coding with m being a power of 2 Operations can then be done by masking and shifting bits 23 November 2011 236620 Search Engine Technology 14 7

Golomb-Rice Coding (cont.) To encode an integer n using the Golomb code with parameter m=2 k : Write n as r*m+d, where r= n/m (the quotient) and 0 d < m is the remainder Represent r+1 in unary (since 0 doesn t have a unary encoding) Represent the remainder (n mod m) in binary using k bits Integer m=4 # bits m=8 # bits 0-3 0xx 3 0xxx 4 4-7 10xx 4 0xxx 4 8-11 110xx 5 10xxx 5 12-15 1110xx 6 10xxx 5 16-19 11110xx 7 110xxx 6 23 November 2011 236620 Search Engine Technology 15 Matching Code to Distribution Unary coding is optimal when Pr(x)=2 -x Gamma is optimal when Pr(x) 1/(2x 2 ) Delta is optimal when Pr(x) 1/2x(log 2x) 2 Golomb-Rice is optimal when Pr(x)=(1-p) x-1 p, i.e. for Geometric distributions Provided that m is chosen such that (1-p) m + (1-p) m+1 1 < (1-p) m + (1-p) m-1 23 November 2011 236620 Search Engine Technology 16 8

Golomb-Rice vs. the Rest For a word that appears in fraction p of the documents, let s consider that each document received the word independently with probability p (i.e. a Bernoulli process) This is an approximation, but a reasonable one in most cases Consequently, the d-gaps are distributed geometrically with parameter p, and Golomb-Rice encoding is optimal Can use different parameters for different postings lists, based on the DF of each term that is stored in the lexicon 23 November 2011 236620 Search Engine Technology 17 Simple9 Encoding Scheme [Anh & Moffat, 2004] A word-aligned, multiple number encoding scheme Encoding block: 4 bytes (32 bits) Most significant nibble (4 bits) describe the layout of the 28 other bits as follows: 0: a single 28-bit number 1: two 14-bit numbers Layout (4 bits) n numbers of b bits each n * b 28 2: three 9-bit numbers (and one spare bit) 3: four 7-bit numbers 4: five 5-bit numbers (and three spare bits) 5: seven 4-bit numbers 6: nine 3-bit numbers (and one spare bit) 7: fourteen two-bit numbers 8: twenty-eight one-bit numbers Simple16 is a variant that defines 5 additional (uneven) configurations Can be efficiently decodable using bit masks 23 November 2011 236620 Search Engine Technology 18 9

PForDelta [S. Heman, 2005] Encode a block of B integers together (e.g. B=128) Determine a percentage threshold x, such that x% of the B integers fit in k bits (e.g. x=90) Allocate an array of kb bits, and write any integer that fits in k bits in its corresponding slot; the minority of integers that don t fit in k bits are called exceptions. Encode the locations of the exceptions by chaining, using their unused k-bit slots in the array The index of the first exception is encoded before the array in log B bits Gap to next exception is encoded in k bits; if it doesn t fit in k bits, force an additional exception Encode the exceptions somehow after the log B + kb bits. 23 November 2011 236620 Search Engine Technology 19 Practical Considerations Most search engines are believed to be using byte-aligned compression schemes While this favors Vint/Group-Varint and Simple9, one can also byte-align any of the other methods, by adding padding zeros When using d-gap compression on postings lists that support efficient skipping, each possible landing point of a skip (e.g. each block in a B + Tree) must start with an absolute docid rather than a d-gap from the previous postings element Offsets (locations) within documents are also encoded 23 November 2011 236620 Search Engine Technology 20 10

DocID Assignment Problem The previous methods all compressed d-gaps; in all cases, small d- gaps are encoded by less bits than large ones Can documents be ordered (i.e. can document identifiers be assigned) such that the implied d-gaps are smaller and will thus compress better? c1 c2 c3 c4 c5 c6 c7 c8 1 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 0 1 0 1 0 c6 c3 c8 c1 c2 c5 c7 c4 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 0 23 November 2011 236620 Search Engine Technology 21 DocID Assignment Problem (cont.) When framed as an optimization (minimization) problem of finding the best permutation of a set of documents, docid assignment is NP-Hard For small d-gaps - smaller than the expected N/df(t) - documents with similar terms should be assigned close docids Techniques applied: clustering, TSP approximations Observation: if a d-gap cannot be made smaller than N/df(t), try making it as large as possible (why?) N number of document; df(t) document frequency of term t 23 November 2011 236620 Search Engine Technology 22 11

DocID Assignment by URL Sorting State of the art on Web collections is surprisingly simple ordering URLs by lexicographic order results in good compression (small d- gaps), as same-host pages that use similar vocabulary are grouped [Silvestri, ECIR 2007] Same topic, same page template, navigation bars, etc. Lexicographic URL sorting further preserves finer-grain site structure However, most of the benefit of this method is gained from simply grouping same-host documents together Lexicographic URL sorting is also key to compressing the Web graph [Boldi & Vigna, WWW 2004] 23 November 2011 236620 Search Engine Technology 23 Further Research The following areas have been researched: Exploiting redundancy when indexing multiple documents with highly overlapping content: Near-duplicate Web pages Versioned documents (code, Wikipedia) Email threads (back-and-forth messages) Document assignment problems on partitioned indexes Achieving compact representations of the Web graph In particular, the adjacency lists can also be d-gap encoded 23 November 2011 236620 Search Engine Technology 24 12