Partitioned Elias-Fano Indexes



Similar documents
arxiv: v1 [math.pr] 9 May 2008

An Innovate Dynamic Load Balancing Algorithm Based on Task

Cooperative Caching for Adaptive Bit Rate Streaming in Content Delivery Networks

RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION. Henrik Kure

Information Processing Letters

Reliability Constrained Packet-sizing for Linear Multi-hop Wireless Networks

Real Time Target Tracking with Binary Sensor Networks and Parallel Computing

Online Bagging and Boosting

ON SELF-ROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX

Searching strategy for multi-target discovery in wireless networks

Use of extrapolation to forecast the working capital in the mechanical engineering companies

Preference-based Search and Multi-criteria Optimization

The Benefit of SMT in the Multi-Core Era: Flexibility towards Degrees of Thread-Level Parallelism

Software Quality Characteristics Tested For Mobile Application Development

CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS

Analyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy

Factored Models for Probabilistic Modal Logic

Data Set Generation for Rectangular Placement Problems

Airline Yield Management with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN

Media Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation

Halloween Costume Ideas for the Wii Game

Managing Complex Network Operation with Predictive Analytics

2. FINDING A SOLUTION

INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE SYSTEMS

On Computing Nearest Neighbors with Applications to Decoding of Binary Linear Codes

A framework for performance monitoring, load balancing, adaptive timeouts and quality of service in digital libraries

AN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES

Data Streaming Algorithms for Estimating Entropy of Network Traffic

The Research of Measuring Approach and Energy Efficiency for Hadoop Periodic Jobs

This paper studies a rental firm that offers reusable products to price- and quality-of-service sensitive

How To Get A Loan From A Bank For Free

Energy Proportionality for Disk Storage Using Replication

Machine Learning Applications in Grid Computing

Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2

An Approach to Combating Free-riding in Peer-to-Peer Networks

Applying Multiple Neural Networks on Large Scale Data

Reconnect 04 Solving Integer Programs with Branch and Bound (and Branch and Cut)

Research Article Performance Evaluation of Human Resource Outsourcing in Food Processing Enterprises

Evaluating Inventory Management Performance: a Preliminary Desk-Simulation Study Based on IOC Model

Resource Allocation in Wireless Networks with Multiple Relays

A quantum secret ballot. Abstract

Position Auctions and Non-uniform Conversion Rates

Extended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona Network

Efficient Key Management for Secure Group Communications with Bursty Behavior

Quality evaluation of the model-based forecasts of implied volatility index

A Fast Algorithm for Online Placement and Reorganization of Replicated Data

Models and Algorithms for Stochastic Online Scheduling 1

PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO

Fuzzy Sets in HR Management

Markov Models and Their Use for Calculations of Important Traffic Parameters of Contact Center

CLOSED-LOOP SUPPLY CHAIN NETWORK OPTIMIZATION FOR HONG KONG CARTRIDGE RECYCLING INDUSTRY

The Application of Bandwidth Optimization Technique in SLA Negotiation Process

ASIC Design Project Management Supported by Multi Agent Simulation

Generating Certification Authority Authenticated Public Keys in Ad Hoc Networks

An Improved Decision-making Model of Human Resource Outsourcing Based on Internet Collaboration

Markovian inventory policy with application to the paper industry

REQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES

An Optimal Task Allocation Model for System Cost Analysis in Heterogeneous Distributed Computing Systems: A Heuristic Approach

SOME APPLICATIONS OF FORECASTING Prof. Thomas B. Fomby Department of Economics Southern Methodist University May 2008

Local Area Network Management

The Velocities of Gas Molecules

Audio Engineering Society. Convention Paper. Presented at the 119th Convention 2005 October 7 10 New York, New York USA

A Multi-Core Pipelined Architecture for Parallel Computing

ADJUSTING FOR QUALITY CHANGE

Adaptive Modulation and Coding for Unmanned Aerial Vehicle (UAV) Radio Channel

Optimal Resource-Constraint Project Scheduling with Overlapping Modes

PREDICTION OF POSSIBLE CONGESTIONS IN SLA CREATION PROCESS

Impact of Processing Costs on Service Chain Placement in Network Functions Virtualization

Introduction to Unit Conversion: the SI

CPU Animation. Introduction. CPU skinning. CPUSkin Scalar:

Leak detection in open water channels

SAMPLING METHODS LEARNING OBJECTIVES

Red Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure

6. Time (or Space) Series Analysis

Protecting Small Keys in Authentication Protocols for Wireless Sensor Networks

Method of supply chain optimization in E-commerce

Calculating the Return on Investment (ROI) for DMSMS Management. The Problem with Cost Avoidance

Factor Model. Arbitrage Pricing Theory. Systematic Versus Non-Systematic Risk. Intuitive Argument

International Journal of Management & Information Systems First Quarter 2012 Volume 16, Number 1

OpenGamma Documentation Bond Pricing

Online Appendix I: A Model of Household Bargaining with Violence. In this appendix I develop a simple model of household bargaining that

Construction Economics & Finance. Module 3 Lecture-1

Implementation of Active Queue Management in a Combined Input and Output Queued Switch

Part C. Property and Casualty Insurance Companies

Exercise 4 INVESTIGATION OF THE ONE-DEGREE-OF-FREEDOM SYSTEM

Dynamic right-sizing for power-proportional data centers Extended version

Binary Embedding: Fundamental Limits and Fast Algorithm

Transcription:

Partitioned Elias-ano Indexes Giuseppe Ottaviano ISTI-CNR, Pisa giuseppe.ottaviano@isti.cnr.it Rossano Venturini Dept. of Coputer Science, University of Pisa rossano@di.unipi.it ABSTRACT The Elias-ano representation of onotone sequences has been recently applied to the copression of inverted indexes, showing excellent query perforance thanks to its efficient rando access and search operations. While its space occupancy is copetitive with soe state-of-the-art ethods such as γ-δ-golob codes and PorDelta, it fails to exploit the local clustering that inverted lists usually exhibit, naely the presence of long subsequences of close identifiers. In this paper we describe a new representation based on partitioning the list into chunks and encoding both the chunks and their endpoints with Elias-ano, hence foring a twolevel data structure. This partitioning enables the encoding to better adapt to the local statistics of the chunk, thus exploiting clustering and iproving copression. We present two partition strategies, respectively with fixed and variablelength chunks. or the latter case we introduce a linear-tie optiization algorith which identifies the iniu-space partition up to an arbitrarily sall approxiation factor. We show that our partitioned Elias-ano indexes offer significantly better copression than plain Elias-ano, while preserving their query tie efficiency. urtherore, copared with other state-of-the-art copressed encodings, our indexes exhibit the best copression ratio/query tie trade-off. Categories and Subject Descriptors H.3.2 [Inforation Storage and Retrieval]: Inforation Storage; E.4 [Coding and Inforation Theory]: Data Copaction and Copression Keywords Copression; Dynaic Prograing; Inverted Indexes 1. INTRODUCTION The inverted index is the data structure at the core of ost large-scale search systes for text, (sei-)structured data, and graphs, with web search engines, XML and RD Perission to ake digital or hard copies of all or part of this work for personal or classroo use is granted without fee provided that copies are not ade or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for coponents of this work owned by others than the author(s) ust be honored. Abstracting with credit is peritted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific perission and/or a fee. Request perissions fro perissions@ac.org. SIGIR 14, July 6 11, 2014, Gold Coast, Queensland, Australia. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2257-7/14/07...$15.00. http://dx.doi.org/10.1145/2600428.2609615. databases, and graph search engines in social networks as the ost notable exaples. The huge size of the corpora involved and the stringent query efficiency requireents of these applications have driven a large aount of research with the ultiate goal of iniizing the space occupancy of the index and axiizing its query processing speed. These are two conflicting objectives: a high level of copression is obtained by reoving the redundancy in the dataset, high speed is obtained by keeping the data easily accessible and by augenting the dataset with auxiliary inforation that drive the query processing algorith. The effects of space reduction on the eory hierarchy partially itigate this dichotoy. Indeed, eory transfers are an iportant bottleneck in query processing, and, thus, fitting ore data into higher and faster levels of the hierarchy reduces the transfer tie fro the slow levels, and, hence, speeds up the algorith [5]. In fact, in the last few years the focus has shifted fro disk-based indexes to in-eory indexes, as in any scenarios it is not possible to afford a single disk seek. However, these beneficial effects can be nullified by slow decoding algoriths. Thus, research has focused its attention on designing solutions that best balance decoding tie and space occupancy. In its ost basic and popular for, an inverted index is a collection of sorted sequences of integers [6, 16, 27]. Copressing such sequences is a crucial proble which has been studied since the 1950s; the literature presents several approaches, each of which introduces its own trade-off between space occupancy and decopression speed [15, 17, 19, 22, 26]. Most of these ethods only allow sequential decoding, so the lists involved while processing a query need to be entirely decoded. This can be avoided by using the standard technique of splitting each sequence into blocks of fixed size (say, 128 eleents) and encoding each block independently, so that it is possible to avoid the decopression of portions which are not necessary for answering the query. Still, a block ust be fully decoded even if just one of its values is required. Recently Vigna [23] overcae this drawback by introducing a new data structure called quasi-succinct index. This index hinges on the Elias-ano representation of onotone sequences [10, 11], a conceptually siple and elegant data structure that supports fast rando access and search operations, cobining strong theoretical guarantees and excellent practical perforance. Vigna showed that quasi-succinct indexes can copete in speed with ature and highly engineered ipleentations of state-of-the-art inverted indexes. In particular, Elias-ano shines when the values accessed are scattered in the list; this is very coon in conjunctive queries when the nuber of results is significantly saller

than the sequences being intersected. A typical scenario, for exaple, is intersecting edge sets in large social network graphs, as acebook s Graph Search, which has in fact adopted an ipleentationof Elias-ano indexes [8]. Experients show however that Elias-ano indexes can be significantly larger than those obtained using state-of-the-art encoders. This inefficiency is caused by its inability to exploit any characteristics of the sequence other than two global statistics: the nuber of eleents in the list, and the value of the largest eleent. More precisely, Elias-ano represents a onotone sequence of integers saller than u with roughly log u + 2 bits of space, regardless of any regularities in the sequence. As an extree toy exaple consider the sequence S = [0, 1, 2,..., 2, u 1] of length. This sequence is highly copressible since the length of the first run and the value of u, which can be encoded in O(log u) bits, are sufficient to describe S. Elias-ano requires instead log u + 2 bits for every eleent in the sequence, which is not one single bit less than it would require to represent a rando sequence of sorted eleents saller than u. Even if this is just a toy exaple, it highlights an issue that occurs frequently in copressing posting lists, whose copressibility is caused by large clusters of very close values. In this paper we tackle this issue by partitioning the sequences into contiguous chunks and encoding each chunk independently with Elias-ano, so that the encoding of each chunk can better adapt to the local statistics. To perfor rando access and search operations, the lists of chunk endpoints and boundary eleents are in turn encoded with Elias-ano, thus foring a two-level data structure which supports fast queries on the original sequence. The chunk endpoints can be copletely arbitrary, so we propose two strategies to define the partition. The first is a straightforward unifor partition, eaning that each chunk (except possibly the last) has the sae fixed size. The second ais at iniizing the space occupancy by setting up the partitioning as an instance of an optiization proble, for which we present a linear-tie algorith that is guaranteed to find a solution at ost (1 + ɛ) ties larger than the optial one, for any given ɛ (0, 1). We perfor an extensive experiental analysis on two large text corpora, naely and (Category B), with several query operations, and copare our indexes with the plain quasi-succinct indexes as described in [23], and with three state-of-the-art list encodings, naely Binary Interpolative Coding [17], the PorDelta variant OptPD [26], and Varint-G8IU [22], a SIMD-optiized Variable Byte code. Respectively, they are representative of best copression ratio, best copression ratio/processing speed trade-off, and highest speed in the literature. In our experients, we show that our indexes are significantly saller than the original quasi-succinct indexes, with a sall query tie overhead. Copared to the others, they are slightly larger but significantly faster than Interpolative, both faster and saller than OptPD, and slightly slower but significantly saller than Varint-G8IU. Our contributions. We list here our ain contributions. 1. We introduce a two-level representation of onotone sequences which, given a partition of a sequence into chunks, represents each chunk with Elias-ano and stores the endpoints of the chunks and their boundary values in separate Elias-ano sequences, in order to support fast rando access and search operations. 2. We describe two partitioning strategies: an unifor strategy, which divides the sequence into chunks with a fixed size, and a strategy with variable-length chunks, whose endpoints are chosen by solving an optiization proble in order to iniize the overall space occupancy. More precisely, we introduce a linear-tie dynaic prograing algorith which finds a partition whose cost is at ost (1 + ɛ) ties larger than the optial one, for any given ɛ (0, 1). In the experients we show that this ɛ-optial strategy gives significantly saller indexes than the unifor strategy, which in turn are significantly saller than the indexes obtained with non-partitioned Elias-ano. Indeed, the latter are ore than 23% larger on and ore than 64% larger on. 3. We show with an extensive experiental analysis that the partitioned indexes are only slightly slower than non-partitioned indexes. urtherore, in coparison with other indexes fro the literature, the ɛ-optial indexes doinate in both space and tie ethods which are in the sae region of the space-tie tradeoff curve, and obtain spaces very close to the ethods which give the highest copression ratio. More precisely, Binary Interpolative Coding is only 2%-8% saller but up to 5.5 ties slower; OptPD is roughly 12% larger and alost always slower; Varint-G8IU is 10%-40% faster but ore than 2.5 ties larger. 2. BACKGROUND AND NOTATION Given a collection D of docuents, the posting list of a ter t is the list of all the docuent identifiers, or docids, that contain the ter t. The collection of posting lists for all the ters in the collection is called the inverted index of D; the set of such ters is usually called the dictionary. Posting lists are often augented with additional inforation about each docid, such as the nuber of occurrences of the ter in the docuent (the frequency), and the set of positions where the ter occurs. Since the inverted index is a fundaental data structure in virtually all odern search systes, there is a vast aount of literature describing several variants and query processing strategies; we refer the reader to the excellent surveys and books on the subject [6, 16, 27]. In the following, we adopt the docid-sorted variant, eaning that each posting list is sorted by docid; this enables fast query processing and efficient copression. In our experients we focus our attention on posting lists storing docids and frequencies; we do not store positions or other additional data, since they have different nature and often require specialized copression techniques [25], thus they are outside of the scope of this paper. We also ignore additional per-docuent or per-ter inforation, such as the appings between docids and URLs, or between terids and actual ters, as their space is negligible copared to the index size. Query Processing. Given a ter query as a (ulti-)set of ters, the basic query operations are the boolean conjunctive (AND) and disjunctive (OR) queries, retrieving the docuents that contain respectively all the ters or at least one of the. In any scenarios the query-docuent pairs can be associated with a relevance score which is usually a function of the ter frequencies in the query and in the docuent, and other global statistics. Instead of the full set of atches,

for scored queries it is often sufficient to retrieve the k highest scored docuents for a given k. A widely used relevance score is BM25 [18], which we will use in our experients. There exist two popular query processing strategies, dual in nature, naely Ter-at-a-Tie (TAAT) and Docuentat-a-Tie (DAAT). The forer scans the posting list of each query ter separately to build the result set, while the latter scans the concurrently, keeping the aligned by docid. We will focus on the DAAT strategy as it is the ost natural for docid-sorted indexes. The alignent of the lists during DAAT scanning can be achieved by eans of the NextGEQ t (d) operator, which returns the sallest docid in the list of t that is greater than or equal to d. A fast ipleentation of the function NextGEQ t (d) is crucial for the efficiency of this process. The trivial ipleentation that scans sequentially the whole posting lists is usually too slow; a coon solution resorts to skipping strategies. The basic idea is to divide the lists in sall blocks that are copressed independently, and to store additional inforation about each block, in particular the axiu docid present in the block. This allows to find and decode only the block that ay contain the sought docid by scanning the list of axia, thus skipping a potentially large nuber of useless blocks. In the following we call block-based the indexes that adopt this technique. Solving scored queries can be achieved with DAAT by coputing the relevance score for the atching docuents as they are found, and aintaining a priority queue with the top-k atches. This can be very inefficient for scored disjunctive queries, as the whole lists need to be scanned. Several query processing strategies have been introduced to alleviate this proble. Aong the, one the ost popular is WAND [3], which augents the index by storing for each ter its axiu ipact to the score, thus allowing to skip large segents of docids if they only contain ters whose su of axiu ipacts is saller than the top-k docuents found up to that point. Again, WAND can be efficiently ipleented in ters of NextGEQ t (d). 3. RELATED WORK Index copression ultiately reduces to the proble of representing sequences, specifically strictly onotone sequences for docids, and positive sequences for the frequencies. The two are equivalent: a strictly onotone sequence can be turned into a positive sequence by subtracting fro each eleent the one that precedes it (also known as delta encoding), the other direction can be achieved by coputing prefix sus. or this reason ost of the work assues that the posting lists are delta-encoded and focuses on the representation of sequences of positive integers. Representing such sequences of integers in copressed space is a crucial proble which has been studied since the 1950s with applications going beyond inverted indexes. A classical solution is to assign to each integer an uniquelydecodable variable length code; if the codes do not depend on the input they are called universal codes. The ost basic exaple is the unary code, which encodes a non-negative integer x as the bit sequence 0 x 1. The unary code is efficient only if the input distribution is concentrated on very sall integers. More sophisticated codes, such as Elias Gaa/Delta codes and Golob/Rice codes build on the unary code to efficiently copress a broader class of distributions. We refer to Saloon [19] for an in-depth discussion on this topic. Bit-aligned codes can be inefficient to decode as they require several bitwise operations, so byte-aligned or wordaligned codes are usually preferred if speed is a ain concern. Variable byte [19] or VByte is the ost popular byte-aligned code. In VByte the binary representation of a non-negative integer x is split into groups of 7 bits which are represented as a sequence of bytes. The lower 7 bits of each byte store the data, whereas the eighth bit, called the continuation bit, is equal to 1 only for the last byte of the sequence. Stepanov et al. [22] present a variant of variable byte (called Varint-G8IU) which exploits SIMD operations of odern CPUs to further speed up decoding. A different approach is to encode siultaneously blocks of integers in order to iprove both copression ratio and decoding speed. This line of work has seen in the last few years a proliferation of encoding approaches which find their coon roots in frae-of-reference (or) [14]. Their underlying idea is to partition the sequence of integers into blocks of fixed or variable length and to encode each block separately. The integers in each block are encoded by resorting to codewords of fixed length. A basic application of this technique (called also binary packing or packed binary [2,15]) partitions the sequence of integers into blocks of b consecutive integers (e.g., b = 128 integers); for each block, the algorith encodes the range enclosing the values in the block, say [l, r], then each value is subtracted l and represented with h = log(r l + 1) bits. There are several variants of this approach which differentiate theselves for their encoding or partitioning strategies [9, 15, 21]. or exaple, Siple-9 and Siple-16 [1, 2, 26] are two popular variants of this approach. A ajor space inefficiency of or is the fact that the presence of few large values in the block forces the algorith to encode all its integers with a large h, thus affecting the overall copression perforance. To address this issue, PorDelta (PD) [28] introduces the concept of patching. In PD the value of h is chosen so that h bits are sufficient to encode a large fraction of the integers in the block, say 90%. Those integers that do not fit within h bits are called exceptions and encoded separately with a different encoder (e.g., Siple-9 or Siple-16). Yan et al. s introduce the OptPD variant [26], which selects for each block the value of h that iniizes the space occupancy. OptPD is ore copact and only slightly slower than the original PD [15, 26]. A copletely different approach is taken by Binary Interpolative Coding [17], which skips the delta-encoding step and directly encodes strictly onotone sequences. This ethod recursively splits the sequence of integers in two halves, encoding at each split the iddle eleent and recursively the two halves. At each recursive step the range that encloses the iddle eleent is reduced, and so is the nuber of bits needed to encode it. Experients [21, 24, 26] have shown that Binary Interpolative Coding is the best encoding ethod for highly clustered sequence of integers. However, this space efficiency is paid at the cost of a very slow decoding algorith. Recently, Vigna [23] introduced quasi-succinct indexes, based on the Elias-ano representation of onotone sequences, showing that it is copetitive with delta-encoded block-based indexes. Our paper ais at aking this representation ore space-efficient. Optial partitioning algoriths. The idea of partitioning a sequence to iprove copression dates back to Buchsbau et al. [4], who address the proble of partitioning the

input of a copressor C so that copressing the chunks individually yields a saller space than copressing the whole input at once. Their paper discusses only the case of copressing a large table of records with gzip but their solution can be adapted to solve the ore general proble stated above. Their approach is to reduce this optiization proble to a dynaic prograing recurrence which is solved in Θ( 3 ) tie and Θ( 2 ) space, where is the input size. Silvestri and Venturini [21] resort to a siilar dynaic prograing recurrence to optiize their encoder for posting lists. They obtain an O(h) construction tie by liiting the length of the longest part to h. erragina et al. [12] significantly iprove the result in [4] by coputing in O( log 1+ɛ ) tie and O() space a partition whose copressed size is guaranteed to be at ost (1 + ɛ) ties the optial one, for any given ɛ > 0, provided that the copression ratio of C on any portion of the input can be estiated in constant tie. In this paper we apply the sae ideas in [12] to the Elias- ano representation and we exploit soe of its properties to copute exactly, as opposed to estiating, its encoding cost in constant tie. Then, we iprove the optiization algorith reducing its tie to O() while preserving the sae approxiation guarantees. 4. SEARCHABLE SEQUENCES The Elias-ano representation [10, 11] is an elegant encoding for onotone sequences, which provides good copression ratio and efficient access and searching operations. Consider a onotonically increasing sequence S[0, 1] of nonnegative integers (i.e., S[i] S[i + 1], for any 0 i < 1) drawn fro an universe [u] = {0, 1,..., u 1}. Given an integer l, the eleents of S are conceptually grouped into buckets according to their log u l higher bits. The nuber of buckets is thus u 2 l. The cardinalities of these buckets (including the epty ones) are written into a bitvector H with negated unary codes; it follows that H has length at ost + u bits. The reaining l lower bits of 2 each integer are concatenated l into a bitvector L, which thus requires l bits. It is easy to see that H and L are sufficient to recover S[i] for every i: its log u l higher bits are equal to the nuber of 0s preceding the ith 1 in H, and the l lower bits can be retrieved directly fro L. While l can be chosen arbitrarily between 0 and log u, it can be shown that the value l = log u iniizes the overall space. Suing up the lengths of H and L, it follows that the representation requires at ost log u + 2 bits. Despite its siplicity, it is possible to support efficiently powerful operations on S. or our purposes we are interested in the following. Access(i) which, given i [], returns S[i]; NextGEQ(x) which, given x [u], returns the sallest eleent in S which is greater than or equal to x. The support for these operations requires to augent H with an auxiliary data structure to efficiently answer Select 0 and Select 1 operations, which, given an index i, return the position of respectively the ith 1 or the ith 0 in H. See [23] and references therein for ore details on the ipleentation of these standard operations. To access the ith eleent of S we have to retrieve and concatenate its higher and lower bits. The value of the higher bits is obtained by coputing Select 1(i) i in H, which represents the nuber of 0s (thus, buckets) ending before the ith occurrence of 1. The lower bits are directly accessed by reading l consecutive bits in L starting fro position il. The operation NextGEQ(x) is supported by observing that p = Select 0(h x) h x is the nuber of eleents of S whose higher bits are saller than h x, where h x are the higher bits of x. Thus, p is the starting position in S of the eleents whose higher bits are equal to h x (if any) or larger. NextGEQ(x) is identified by scanning the eleents starting fro p. Several ipleentations of Elias-ano supporting efficiently these operations are available 1. All the operations can be ipleented in a stateful cursor, in order to exploit locality of access by optiizing short forward skips, which are very frequent in query processing. An additional convenient cursor operation is Next, which advances the cursor fro position i to i + 1. Note that Elias-ano requires just weak onotonicity of S, so if only the Access operation is needed, the space occupancy can be reduced to log u +1 + 2 bits by turning S into a weakly onotone sequence S [i] = S[i] i and encoding S with Elias-ano. At query tie we can recover the ith entry of S by coputing Access S (i) + i. The quasi-succinct indexes of Vigna [23] are a direct application of Elias-ano; the posting lists are iediately representable with Elias-ano as the docids are onotonically sorted. Since Elias-ano is not efficient for dense sequences, a bitvector is used instead when it is convenient to do so; the Access and NextGEQ can be supported efficiently with sall additional data structures. The frequencies can be turned into a strictly onotone sequence by coputing their prefix sus, and the ith frequency can be recovered as Access(i) Access(i 1). Moreover, since the query processing needs only Access on the frequencies, we can use the trick entioned above to reduce the space occupancy of the representation. In positional indexes, a siilar transforation can be used to represent the ter positions. 4.1 Unifor partitioning As we argued above Elias-ano uses roughly log u + 2 bits per eleent regardless the sequence being copressed. Notice that log u is the logarith of the average distance aong consecutive eleents in the sequence. Apart fro this average distance, Elias-ano does not exploit in any way the nature of the underlying sequence and, thus, for exaple it does not ake any distinction between the two extree cases of a randoly generated sequence and a sequence fored by only a long run of consecutive integers. While paying log u bits per eleent is the best we can hope for in the forer case, we would expect to achieve a better space occupancy in the latter. Indeed, a sequence of integers is intuitively ore copressible than a rando one when it contains regions of integers which are very close to each other. The presence of these regions is typical in posting lists. Consider for exaple a ter which is present only in few doains. If the docids are assigned by sorting the docuents by their URLs, then the posting list of this ter is, apart fro few outliers, fored by clusters of integers in correspondence of those doains. Observe that, since the eleents within each region are very close to each other, the average distance intra-region is uch saller than the global average distance. 1 Such as the open-source http://github.co/facebook/ folly, http://github.co/siongog/sdsl, http:// github.co/ot/succinct, and http://sux.di.unii.it.

The above observation is the ain otivation for introducing a two-level Elias-ano. The basic idea is to partition the sequence S into /b chunks of b consecutive integers each, except possibly the last one. The first level is an Elias-ano representation of the sequence L obtained by juxtaposing the last eleent of each chunk (i.e., L = S[b 1], S[2b 1],..., S[ 1]). The second level is the collection of the chunks of S, each represented with Elias-ano. The ain advantage of having this first level is that the eleents of the jth chunk can be rewritten in a saller universe of size u j = L[j] L[j 1] 1 by subtracting L[j 1] + 1 fro each eleent. Thus, the Elias-ano representation of the chunk requires log u j + 2 bits per eleent. Since the b quantity u j is the average distance of the eleents within b the chunk, we expect that this space occupancy is uch saller that the one obtained by representing the sequence in its entirety, especially for highly copressible sequences. Observe that part of this gain vanishes due to the cost of the first level which is log u + 2 bits every b original /b eleents. Indeed, the first level stores /b integers drawn fro a universe of size u. This two-level representation introduces a level of indirection in solving the operations Access and NextGEQ. The operation Access(i) is solved as follows. Let j be the index of the chunk containing the ith eleent of S (i.e., j = i/b ) and k be its offset within this chunk (i.e., k = i od b). We access L[j 1] and L[j] on the first level to copute the size of the universe u j of the chunk as L[j] L[j 1] (or L[j] if j is the first chunk). Knowing u j and b suffices for accessing the kth eleent of the jth chunk. If e is the value at this position, then we conclude that the value S[i] is equal to L[j] + 1 + e. The operation NextGEQ(x) is solved as follows. We first copute the successor of x, say L[j], on the first level. This iplies that the successor of x in S is within the jth chunk and, thus, it can be identified by solving NextGEQ(x L[j] 1) on this chunk. An iportant distinction with block-based indexes is that the choice of b does not affect significantly the efficiency of the operations: while block-based indexes ay need to scan the full block to retrieve one eleent, the chunks in our representation are searchable sequences theselves, so the perforance does not degrade as b gets larger; it actually gets better, as fewer block boundaries have to be crossed during a query. In our ipleentation we use different encodings to overcoe the space inefficiencies of Elias-ano in representing dense chunks. The jth chunk is dense if the chunk covers a large fraction of the eleents in the universe [u j] (or, in other words, b is close to u j). Indeed, the space bound b log u j + 2b bits becoes close to 2uj bits whenever b b approaches u j. However, we can always represent the chunk within u j bits by writing the characteristic vector of the set of its eleents as a bitvector. Note that Vigna [23] also uses this technique but only for whole lists, which are very unlikely to be so dense except for very few ters such as stop-words. In our case, instead, we expect dense chunks to occur frequently in representing posting lists, because they can be contained inside dense clusters of docids. Hence, besides Elias-ano, we adopt two other encodings chosen depending on the relation between u j and b. The first encoding addresses the extree case in which the chunk covers the whole universe (i.e., whenever u j is equal to b). The first level gives us the values of u j and b which are enough by theselves to derive all the eleents in the chunk without the need of encoding further inforation. Operations Access and NextGEQ becoe trivial: both Access(i) and NextGEQ(i) are equal to i. The second encoding is used whenever the size of the Elias-ano representation of the chunk is larger than u j bits (i.e., whenever b > u j ). In this case we encode 4 the set of eleents in the chunk by writing its characteristic vector in u j bits. The Access and NextGEQ operations can be reduced to the standard Rank and Select operations on bitvectors; their ipleentation is described in detail in [23]. 4.2 Optial partitioning Splitting S into chunks of fixed size is likely to be suboptial, since we cannot expect the dense clusters in S to appear aligned with the unifor partition. Intuitively it would be better to allow S to be partitioned freely, with chunks of variable size. It is not obvious however how to find the partition that iniizes the overall space occupancy: on one hand, the chunks should be as large as possible to iniize the nuber of entries in the first level of the representation and, thus, its space occupancy; on the other hand, the chunks should be as sall as possible to iniize the average distances between their eleents, and, thus, the space occupancy of the second level of the representation. An optial partition can be coputed in Θ(n 2 ) tie and space by solving a variant of dynaic prograing recurrence introduced in [4]. However, these prohibitive coplexities ake this solution unfeasible for inputs larger than few thousands of integers. This is the ain otivation for designing an approxiation algorith which reduces the tie and space coplexities to linear at the cost of finding slightly suboptial solutions. More precisely, in this subsection we present an algorith that identifies in O( log 1+ɛ 1 ɛ ) tie and linear space a partition whose cost is only 1 + ɛ ties larger than the optial one, for any given ɛ (0, 1). Observe that the tie coplexity is linear as soon as ɛ is constant. Before entering into the technical details of our solution, it is convenient to fix precisely the space costs involved in our representation. The space occupancy of a given partition P of k chunks S[i 0, i 1 1] S[i 1, i 2 1]... S[i k 1, i k ], with i 0 = 0 and i k = 1, is C(P ) = k 1 h=0 C(S[i h, i h+1 1]) bits, where C(S[i, j]) is the cost of representing the chunk S[i, j]. Each of these costs C(S[i, j]) is the su of two ters: a fixed cost to store inforation regarding the chunk in the first level and the cost of representing its eleents in the second level. Concerning the fixed cost, for each chunk we store three integers in the first level: the largest integer within the chunk, the size of the chunk, and the pointer to its second-level Elias- ano representation. Thus, we can safely upper bound this cost with the quantity 2 log u + log bits. Instead, the cost of representing the eleents in S[i, j] is coputed by taking the iniu between the costs of the three possible encodings introduced in the previous subsection. Depending on the size of the universe u = S[j] S[i 1] (or, u = S[j], if i = 0) and the nuber of eleents = j i + 1, these three costs are i) l + + u 2 l bits with l = log u, if S[i, j] is encoded with Elias-ano; ii) bits, if S[i, j] is encoded with its characteristic vector; iii) 0 bits, if = u and, thus, S[i, j] covers the whole universe. A crucial property to devise our approxiation algorith is the onotonicity of the cost function C, naely, for any i, j and k with 0 i < j < k, we have C(S[i, j]) C(S[i, k]). In the following we first use the algorith in [12] to ob-

tain a solution which finds a (1 + ɛ)-approxiation in tie U O( log 1+ɛ ), where U is the cost in bits of representing S as a single partition. Then, we iprove the algorith to obtain a linear tie solution with the sae approxiation guarantees. ollowing [12], it is convenient to recast our optiization proble to the proble of finding a shortest path in a particular directed acyclic graph (DAG) G. Given the sequence S of integers, the graph G has a vertex v 0, v 1,..., v 1 for each position of S plus a duy vertex v arking the end of the sequence. The DAG G is coplete in the sense that, for any i and j with i < j, there exists the edge fro v i to v j, denoted as (v i, v j). Notice that there is a one-to-one correspondence between paths fro v 0 to v in G and partitions of S. Indeed, a path π = (v 0, v i1 )(v i1, v i2 )... (v ik 1, v i ) crossing k edges corresponds to the partition S[0, i 1 1] S[i 1, i 2 1]... S[i k 1, 1] of k chunks. Hence, by assigning the weight w(v i, v j) = C(S[i, j 1]) to each edge (v i, v j), the weight of a path is equal to the cost in bits of the corresponding partition. Thus, a shortest path on G corresponds to an optial partition of S. Coputing a shortest path on a DAG has a tie coplexity proportional to the nuber of edges in the DAG. This is done with a classical elegant algorith which processes the vertices fro left to right [7]. The goal is to copute the value M[v] for each vertex v which is equal to the cost of a shortest path which starts at v 0 and ends at v. Initially, M[v 0] is set to 0, while M[v] is set to + for any other vertex v. When the algorith reaches the vertex v in its left-to-right scan, it assues that M[v] has been already correctly coputed and extends this shortest path with any edge outgoing fro v. This is done by visiting each edge (v, v ) outgoing fro v and by coputing M[v] + w(v, v ). If this quantity is saller than M[v ], the path that follows the shortest path fro v 0 to v and then the edge (v, v ) is currently the best way to reach v. Thus, M[v ] is updated to M[v] + w(v, v ). The correctness of this algorith can be proved by induction and, since each edge is relaxed exactly once, its tie coplexity is proportional to the nuber of edges in the DAG. Unfortunately our DAG G is coplete and, thus, it has Θ(n 2 ) edges, so this algorith by itself does not suffice to obtain an efficient solution for our proble. However, it can be used as the last step of a solution which perfors a non-trivial pruning of G. This pruning produces another DAG G ɛ with two crucial properties: i) its nuber of edges is substantially reduced fro Θ(n 2 U ) to O( log 1+ɛ ), for any given ɛ (0, 1); ii) its shortest path distance is (alost) preserved since it increases by no ore than a factor 1 + ɛ. The pruned graph G ɛ is constructed as the subgraph of G consisting of all the edges (v i, v j) such that at least one of the following two conditions holds: i) there exists an integer h 0 such that w(v i, v j) (1 + ɛ) h < w(v i, v j+1); ii) (v i, v j) is the last outgoing edge fro v i (i.e., j = ). Since w is onotone, these conditions correspond of keeping, for each integer h, the edge of G that better approxiates the value (1 + ɛ) h fro below. The edges of G ɛ are called (1 + ɛ)-axial edges. We point out that, since there exist at U ost log 1+ɛ possible values for h, each vertex of Gɛ has at U ost log 1+ɛ outgoing (1 + ɛ)-axial edges. Thus, the total size of G ɛ is O( log 1+ɛ ). Theore 3 in [12] proves that U the shortest path distance on G ɛ is at ost 1 + ɛ ties larger than the one in G. Thus, given G ɛ, a (1 + ɛ)-approxiated U partition can be coputed in O( log 1+ɛ ) tie with the above algorith. We show now how to further reduce this tie coplexity 1 to O( log 1+ɛ ) tie without altering the approxiation ɛ guarantees. Let ɛ 1 (0, 1] and ɛ 2 (0, 1] be two paraeters to be fixed later. We first obtain the graph Ḡ fro G by keeping only edges whose weight is no ore than L = + 2 ɛ 1 plus the first edge outgoing fro every vertex whose cost is larger than L. Then, we apply the pruning above to Ḡ by fixing the approxiation paraeter to ɛ 2. The only difference here is that we force the pruning to retain the edges in Ḡ of cost larger than L. In this way we obtain a graph Ḡɛ 2 having L O( log ) = O( log 1 1+ɛ2 1+ɛ 2 ɛ 1 ) edges. We can prove that the shortest path distance in Ḡɛ 2 is at ost (1 + ɛ 1)(1 + ɛ 2) ties larger than the one in G. This iplies that the partition coputed with Ḡ is a (1 + ɛ)-approxiation of the optial one, by setting ɛ 1 = ɛ 2 = ɛ so that (1 + ɛ1)(1 + ɛ2) 1 + ɛ. 3 This result is stated in the following lea. Lea 1. or any ɛ 1 > 0 and ɛ 2 > 0, the shortest path distance in Ḡɛ 2 is at ost (1 + ɛ 1)(1 + ɛ 2) ties larger than the one in G. Proof. Let π G, π Ḡ and π Ḡɛ2 denote the shortest paths in G, Ḡ and Ḡɛ 2, respectively. We know that the weight w(π Ḡɛ2 ) of π Ḡɛ2 satisfies w(π Ḡɛ2 ) (1 + ɛ 2)w(π Ḡ ). It reains to prove that w(π Ḡ ) (1 + ɛ 1)w(π G) which allows us to conclude that w(π Ḡɛ2 ) (1 + ɛ 1)(1 + ɛ 2)w(π G). We do so by showing that there exists a path π in Ḡ such that its weight is no ore than (1 + ɛ 1) ties the weight of the shortest path π G of G. The thesis follows iediately because π Ḡ is a shortest path, so by definition w(π Ḡ ) w(π ). The path π is obtained by transforing π G so that its edges are all contained in Ḡ. Note that this transforation is not actually perfored by the algorith but it only serves for proving the lea. This transforation is done as follows. or each edge (u i, v j) in π Ḡ, either w(v i, v j) L, so there is nothing to do, or w(v i, v j) > L, in which case we need to substitute it with a subpath fro v i to v j whose edges are all in Ḡ. This subpath can be found greedily, starting fro v i and traversing always the longest edge until we reach v j. It can be proved that the weighting function w satisfies w(v i, v k ) + w(v k, v j) w(v i, v j) + + 1 for any 0 i < k < j ; intuitively, this eans that by breaking an edge into two shorter edges we lose at ost + 1 bits. By cobining these properties with the fact that all the edges in the subpath except possibly the last have cost at least L, it follows that the nuber of edges in this subpath cannot be larger than w(v i,v j ) + 1 ( ) L w(vi,v and its cost is at ost w(v i, v j) + j ) + 1 ( + 1) L w(v i, v j) + ɛ 1w(v i, v j), thus proving the thesis. It reains to describe how to generate the pruned graph 1 Ḡ ɛ2 in O(n log 1+ɛ ) tie directly without explicitly constructing G which, otherwise, would require quadratic tie. ɛ 1 This is done by keeping k = O(log ) windows W0,..., W 1+ɛ ɛ k sliding over the sequence S, one for each possible exponent h such that (1 + ɛ) h L. These sliding windows cover potentially different portions of S that start at the sae position i but have different ending positions. Ared with these sliding windows, we generate the (1 + ɛ)-axial edges outgoing fro any vertex v i on-the-fly as soon as the shortest path algorith visits this vertex. Initially, each sliding window W j starts and ends at position 0. During the execution,

every tie the shortest path algorith visits the next vertex v i, we advance the starting position of each W j by one position and its ending position until the cost of representing the currently covered portion of S is larger than (1 + ɛ) j. It is easy to prove that if at the end of these oves the window W j covers S[i, r], then (v i, v r) is the (1 + ɛ)-axial edge outgoing fro the vertex v i for the weight bound (1 + ɛ) j. Notice that with this approach we generate all the axial edges of Ḡɛ 2 by perforing a scan of S for each sliding window. Every tie we ove the starting or the ending position of a window we need to evaluate the cost of representing its covered portion of S. This can be done in constant tie with siple arithetic operations. Thus, it follows that generating the pruned graph requires constant tie per edge, hence 1 O(n log 1+ɛ ) tie overall. ɛ We conclude the section by describing how to odify the first-level data structure to support arbitrary partitions: together with the first-level sequence L with the last eleent of each block, we write a second sequence E which contains the positions of the endpoints of the partition. This sequence can be again represented with Elias-ano. The NextGEQ operation can be supported as before, while for Access(i) we can find the chunk of i with NextGEQ on E. Both L and E have as any eleents as the nuber of the chunks in the partition. 5. EXPERIMENTAL ANALYSIS We perfored our experients on the following datasets. is the ClueWeb 2009 TREC Category B test collection, consisting of 50 illion English web pages crawled between January and ebruary 2009. is the TREC 2004 Terabyte Track test collection, consisting of 25 illion.gov sites crawled in early 2004; the docuents are truncated to 256 kb. Docuents 24, 622, 347 50, 131, 015 Ters 35, 636, 425 92, 094, 694 Postings 5, 742, 630, 292 15, 857, 983, 641 Table 1: Basic statistics for the test collections or each docuent in the collection the body text was extracted using Apache Tika 2, and the words lowercased and steed using the Porter2 steer; no stopwords were reoved. The docids were assigned according to the lexicographic order of their URLs [20]. Table 1 reports the basic statistics for the two collections. Indexes tested. We copare three versions of Elias-ano indexes, naely E single, E unifor, and E ɛ-optial, that are respectively the original single-partition representation, a partitioned representation with unifor partitions, and a variable-length partitioned representation optiized with the algorith of Section 4.2. The ipleentation of the base Elias-ano sequences is ostly faithful to the original description and source code [23]. To validate Elias-ano indexes against the ore widespread block-based indexes, we tested three ore indexes, naely 2 http://tika.apache.org/ Interpolative, OptPD, and Varint-G8IU, which are representative of best copression ratio, best copression ratio/processing speed trade-off, and highest speed in the literature [15, 21]. or the last two use the code ade available by the authors of [15]. Actually, recent experients [15] show that there exist ethods faster than Varint-G8IU by roughly 50% in sequential decoding. However, they require very large blocks (fro 2 11 to 2 16 ) aking block decoding prohibitively expensive if the lists are sparsely accessed. or these three indexes we encoded the lists in blocks of 128 postings; we keep in a separate array the axiu docid of each block to perfor fast skipping through linear search (we experiented with falling back to binary search after a sall lookahead, but got worse results). Blocks of postings and blocks of frequencies are interleaved, and the frequencies blocks are decoded lazily as needed. The last block of each list is encoded with variable bytes if it is saller than 128 postings to avoid any block padding overhead. All the ipleentations of posting lists expose the sae interface, naely the Access, NextGEQ, and Next operations described in Section 4. The query algoriths are C++ teplates specialized for each posting list ipleentation to avoid the virtual ethod call overhead, which can be significant for operations that take just a handful of CPU cycles. or the sae reason, we ade sure that frequent code paths are inlined in the query processing code. Testing details. All the algoriths were ipleented in C++11 and copiled with GCC 4.9 with the highest optiization settings. We do not use any SIMD instructions or special processor features, except for the instructions to count the 1 bits in a word and to find the position of the least significant bit; both are available in odern x86-64 CPUs and exposed by the copiler through portable builtins. Apart fro this, all the code is standard C++11. The tests were perfored on a achine with 24 Intel Xeon E5-2697 Ivy Bridge cores (48 threads) clocked at 2.70Ghz, with 64GiB RAM, running Linux 3.12.7. The data structures were saved to disk after construction, and eory-apped to perfor the queries. Before tiing the queries we ensure that the index is fully loaded in eory. The source code is available at http://github.co/ot/ partitioned_elias_fano/tree/sigir14 for the reader interested in replicating the experients. 5.1 Space and construction tie Before coparing the spaces obtained with the different indexes, we have to set the paraeters needed by the partitioned E indexes, naely the chunk size for E unifor and the approxiation paraeters ɛ 1, ɛ 2 for E ɛ-optial. or the forer, we show in igure 1 the space obtained by E unifor on for different values of the chunk size. The iniu space is obtained at 128, which is also a widely used block size for block-based indexes; in all the following experients we will use this chunk size. A little ore care has to be put in choosing the paraeters ɛ 1 and ɛ 2 since they significantly affect the construction tie, as shown in igure 2. irst we fix ɛ 1 at 0, thus posing no upper bound on the chunk costs, and let ɛ 2 vary. Notice that the (1 + ɛ 2) approxiation bound is very pessiistic: even setting ɛ 2 as high as 0.5, eaning a potential 50% overhead, the actual overhead is never higher than 1.5%. Hence, we set ɛ 2 = 0.3 to control the construction tie.

space doc freq space doc freq GB bpi bpi GB bpi bpi E single 7.66 (+64.7%) 7.53 (+83.4%) 3.14 (+32.4%) 19.63 (+23.1%) 7.46 (+27.7%) 2.44 (+11.0%) E unifor 5.17 (+11.2%) 4.63 (+12.9%) 2.58 (+8.4%) 17.78 (+11.5%) 6.58 (+12.6%) 2.39 (+8.8%) E ɛ-optial 4.65 4.10 2.38 15.94 5.85 2.20 Interpolative 4.57 ( 1.8%) 4.03 ( 1.8%) 2.33 ( 1.8%) 14.62 ( 8.3%) 5.33 ( 8.8%) 2.04 ( 7.1%) OptPD 5.22 (+12.3%) 4.72 (+15.1%) 2.55 (+7.4%) 17.80 (+11.6%) 6.42 (+9.8%) 2.56 (+16.4%) Varint-G8IU 14.06 (+202.2%) 10.60 (+158.2%) 8.98 (+278.3%) 39.59 (+148.3%) 10.99 (+88.1%) 8.98 (+308.8%) Table 2: Overall space in gigabytes, and average bits per docid and frequency 8.00 7.50 7.00 6.50 E single E unifor E ɛ-opt 6.00 5.50 5.00 4.50 32 64 128 256 512 1024 igure 1: Index size in gigabytes for with E unifor at different chunk sizes After fixing ɛ 2, we let ɛ 1 vary. Notice the sharp drop in running tie without a noticeable increase in space as soon as ɛ 1 is non-zero: it is a direct consequence of the algorith going fro O( log )-tie to O()-tie. Again, the spread between the worse and best solutions found is saller than 1%. In the following, we set ɛ 1 = 0.03. We found that with these paraeters, the average chunk length on for docid sequences is 231 and for frequencies 466. On they are respectively 142 and 512. or brevity we oit the plots for, which are very siilar to the ones for. 4.66 4.65 4.64 4.63 4.62 4.61 4.60 4.59 0 0.0 0.1 0.2 0.3 0.4 0.5 (a) ɛ 1 = 0, varying ɛ 2 fro 0.025 to 0.5 4.85 4.80 4.75 4.70 4.65 3500 3000 2500 2000 1500 1000 4.60 50 0.00 0.02 0.04 0.06 0.08 0.10 (b) ɛ 2 = 0.3, varying ɛ 1 fro 0 to 0.1 igure 2: Influence of the paraeters ɛ 1 and ɛ 2 on the E ɛ-optial indexes for. Solid line is the overall size in gigabytes (left scale), dashed line is the construction tie in inutes (right scale). 500 350 300 250 200 150 100 Table 2 shows the index space, both as overall size in gigabytes and broken down in bits per integers for docids and frequencies. Next to each value is shown the relative loss/gain (in percentage) copared to E ɛ-optial. The results confir that partitioning the indexes indeed pays off: copared to E ɛ-optial, E single is 64.7% larger on and 23.1% larger on. The optiization strategy also produces significant savings: E unifor is about 11% larger on both and, but still significantly saller than E single. Copared to the other indexes, Varint-G8IU is by far the largest, 2.5 to 3 ties larger than E ɛ-optial; it is particularly inefficient on the frequencies, as it needs at least 8 bits to encode an integer. On the other end of the spectru, Interpolative confirs its high copression ratio and produces the sallest indexes, but the edge with E ɛ-optial is surprisingly sall: only 1.8% on and 8.3% on. To conclude the coparison, we observe that OptPD loses ore than 10% w.r.t. E ɛ-optial: 12.3% on and 11.6% on. Since, as we will show in the following, E ɛ-optial is also faster than OptPD, this iplies that our solution doinates OptPD on the trade-off curve. Regarding construction tie, for all the indexes except E ɛ-optial, can be processed on a single thread in 8 to 10 inutes, while in 24 to 30 inutes; in both cases the construction is essentially I/O-bound. or E ɛ-optial, instead, the algorith that coputes the optial partition, despite linear-tie, has a noticeable CPU cost, raising the tie for to 95 inutes and for to 284 inutes. However, since the encoding of each list can be perfored independently, using all the 24 cores of the test achine akes E ɛ-optial construction I/O-bound as well. Parallelization does not iprove the construction tie of the other encoders. 5.2 Query processing To evaluate the speed of query processing we randoly sapled two sets of 1000 queries respectively fro TREC 2005 and 2006 Efficiency Track topics, drawing only queries whose ters are all in the collection dictionary. The two dataset have quite different statistics, as will be apparent in the results. In his experiental analysis, Vigna [23] provides soe exaples showing that Elias-ano indexes are particularly efficient in conjunctive queries that have sparse results. To ake the analysis ore systeatic, we define as selective any query such that the fraction of the docuents that contain all its ters over those that contain at least one of the is sall (we set this threshold to 0.5%). or both datasets,

selective queries aong TREC 2005 queries are at least 58%, and aong TREC 2006 queries at least 78%, hence aking up the ost part of the saples. The query ties were easured by running each query set 3 ties, and averaging the results. All the ties are reported in illiseconds. In the tiings tables, next to each tiing is reported in parentheses the relative percentage against E ɛ-optial. Not very surprisingly, Interpolative is always 50% to 500% slower than the others, and Varint-G8IU is 10% to 40% faster, so for the sake of brevity we will focus the following analysis on the Elias-ano indexes and OptPD. Boolean queries. We first analyze the basic disjunctive (OR) and conjunctive (AND) queries. Note that these queries do not need the ter frequencies, so only the docid lists are accessed. or these types of operations, we easure the tie needed to count the nuber of results atching the query. E single 80.7 (+8%) 175.0 (+10%) 261.0 (+0%) 444.0 ( 2%) E unifor 72.1 ( 3%) 154.0 ( 3%) 254.0 ( 3%) 435.0 ( 4%) E ɛ-optial 74.5 159.0 261.0 451.0 Interpolative 121.0 (+62%) 257.0 (+62%) 399.0 (+53%) 680.0 (+51%) OptPD 69.5 ( 7%) 148.0 ( 7%) 235.0 ( 10%) 398.0 ( 12%) Varint-G8IU 67.4 ( 10%) 143.0 ( 10%) 222.0 ( 15%) 375.0 ( 17%) Table 3: Ties for OR queries Ties for OR queries are reported in Table 3. Unsurprisingly, as OR needs to scan the whole lists, block-based indexes perfor better than Elias-ano indexes, since they are optiized for raw decoding speed. However, the edge is not as high as one could expect, ranging fro 7% to 17%. In sequential decoding Varint-G8IU can be even double as fast as OptPD [15], however in this task it is not even 10% faster. The reason can be ost likely traced back to branch isprediction penalties: the cost of decoding an integer is in the order of 2-5 CPU cycles; at each decoded docid, the CPU has to decide whether it is equal to the current candidate docid, or if it ust be considered as a candidate for the next docid. The resulting jup is basically unpredictable, and the branch isprediction penalty on odern CPUs can be as high as 10-20 cycles (specifically at least 15 for the CPU we used [13]), thus becoing the ain bottleneck of query processing. Aong Elias-ano indexes, E ɛ-optial and E unifor are either negligibly slower or slightly faster than E single; we believe that the additional coplexity is balanced by the higher eory throughput caused by the saller sizes. Table 4 reports the ties for AND queries. Again, E ɛ-optial is copetitive with E single, except for selective queries where the overhead is slightly higher, touching 14%. E unifor is slightly slower, which is likely caused by the saller chunk sizes copared to E ɛ-optial. This will also be the case in the other queries. Copared to OptPD, E ɛ-optial is 14% to 26% faster in all cases on general queries. The gap becoes even higher for selective queries, ranging fro 34% to 40%, confiring the observations ade in [23]. Ranked queries. In order to analyze the ipact of accessing the frequencies in query tie, we easured the tie required to find the top-10 results for AND and WAND [3] queries E single 2.1 (+10%) 4.7 (+1%) 13.6 ( 5%) 15.8 ( 9%) E unifor 2.1 (+9%) 5.1 (+10%) 15.5 (+8%) 18.9 (+9%) E ɛ-optial 1.9 4.6 14.3 17.4 Interpolative 7.5 (+291%) 20.4 (+343%) 55.7 (+289%) 76.5 (+341%) OptPD 2.2 (+14%) 5.7 (+24%) 16.6 (+16%) 21.9 (+26%) Varint-G8IU 1.5 ( 20%) 4.0 ( 13%) 11.1 ( 23%) 14.8 ( 15%) (a) All queries E single 1.1 ( 11%) 2.5 ( 9%) 9.2 ( 14%) 11.1 ( 13%) E unifor 1.3 (+8%) 3.0 (+11%) 11.3 (+6%) 13.0 (+2%) E ɛ-optial 1.2 2.7 10.7 12.7 Interpolative 6.0 (+399%) 14.3 (+430%) 49.9 (+368%) 61.0 (+379%) OptPD 1.6 (+34%) 3.8 (+40%) 13.0 (+22%) 17.0 (+33%) Varint-G8IU 1.1 ( 11%) 2.5 ( 6%) 8.8 ( 18%) 11.3 ( 12%) (b) Selective queries Table 4: Ties for AND queries using BM25 [18] scoring. Results for scored AND, reported in Table 5, and for WAND, reported in Table 6, are actually very siilar. As before, we note that the overhead of E ɛ-optial against E single is sall, ranging between 2% and 13% for all queries, and between 8% and 18% for selective queries. Also as before, E unifor is about 10% slower than E ɛ-optial; this is likely caused by the optial partitioning algorith placing chunk endpoints around dense clusters of docid, hence aking the query algoriths crossing fewer chunk boundaries. Copared to OptPD, E ɛ-optial is slightly slower on and slightly faster on the larger on all queries. On selective queries, however, it is never slower, and up to 23% faster, thus providing further evidence that Elias-ano indexes benefit significantly fro selectiveness, even for non-conjunctive queries. E single 4.0 ( 2%) 8.4 ( 7%) 22.8 ( 9%) 24.6 ( 13%) E unifor 4.4 (+7%) 9.8 (+8%) 27.3 (+9%) 31.0 (+9%) E ɛ-optial 4.1 9.0 25.1 28.4 Interpolative 14.1 (+242%) 38.6 (+327%) 99.1 (+295%) 132.0 (+365%) OptPD 3.9 ( 7%) 9.2 (+1%) 25.8 (+3%) 31.6 (+11%) Varint-G8IU 2.6 ( 38%) 5.5 ( 39%) 15.8 ( 37%) 18.0 ( 37%) (a) All queries E single 1.7 ( 16%) 4.1 ( 14%) 12.5 ( 18%) 16.2 ( 17%) E unifor 2.1 (+7%) 5.2 (+9%) 16.3 (+7%) 21.1 (+8%) E ɛ-optial 2.0 4.8 15.2 19.4 Interpolative 10.2 (+412%) 25.9 (+439%) 80.7 (+430%) 99.7 (+412%) OptPD 2.3 (+18%) 5.7 (+18%) 18.8 (+23%) 23.1 (+19%) Varint-G8IU 1.4 ( 32%) 3.3 ( 32%) 10.6 ( 30%) 13.6 ( 30%) (b) Selective queries Table 5: Ties for AND top-10 BM25 queries

E single 8.8 ( 4%) 15.8 ( 7%) 31.5 ( 7%) 41.2 ( 13%) E unifor 9.7 (+6%) 18.2 (+7%) 36.9 (+9%) 51.2 (+8%) E ɛ-optial 9.2 17.1 34.0 47.4 Interpolative 28.0 (+203%) 62.6 (+267%) 123.0 (+262%) 200.0 (+322%) OptPD 8.7 ( 6%) 16.7 ( 2%) 35.6 (+5%) 52.1 (+10%) Varint-G8IU 6.1 ( 34%) 11.1 ( 35%) 24.0 ( 30%) 34.3 ( 28%) (a) All queries E single 8.5 ( 8%) 12.2 ( 9%) 21.0 ( 19%) 34.2 ( 15%) E unifor 9.8 (+6%) 14.4 (+7%) 27.3 (+6%) 43.0 (+7%) E ɛ-optial 9.2 13.5 25.8 40.1 Interpolative 33.8 (+265%) 54.7 (+307%) 115.0 (+345%) 179.0 (+346%) OptPD 9.3 (+0%) 14.1 (+4%) 29.9 (+16%) 45.7 (+14%) Varint-G8IU 6.3 ( 32%) 9.2 ( 32%) 19.2 ( 26%) 30.0 ( 25%) (b) Selective queries Table 6: Ties for WAND top-10 BM25 queries 6. CONCLUSION AND UTURE WORK We introduced two new index representations, E unifor and E ɛ-optial, which significantly iprove the copression ratio of Elias-ano indexes at a sall query perforance cost. E ɛ-optial is always convenient except when construction tie is an issue, in which case E unifor offers a 12% worse copression ratio without this additional construction overhead. urtherore, E ɛ-optial offers a better spacetie trade-off than the state-of-the-art OptPD. uture work will focus on aking partitioned Elias-ano indexes even faster; in particular it ay be worth exploring the different space-tie trade-offs that can be obtained by varying the average chunk size. It would also be interesting to devise faster algoriths to copute optial partitions preserving the sae approxiation guarantees. Acknowledgeents This work was supported by Midas EU Project (318786), by MIUR of Italy project PRIN ARS Technoedia 2012 and by ecloud EU Project (325091). We would like to thank Sebastiano Vigna for sharing with us his C++ ipleentation of Elias-ano indexes, which we used to validate our ipleentation. 7. REERENCES [1] V. N. Anh and A. Moffat. Inverted index copression using word-aligned binary codes. Inf. Retr., 8(1), 2005. [2] V. N. Anh and A. Moffat. Index copression using 64-bit words. Softw., Pract. Exper., 40(2):131 147, 2010. [3] A. Z. Broder, D. Carel, M. Herscovici, A. Soffer, and J. Y. Zien. Efficient query evaluation using a two-level retrieval process. In CIKM, pages 426 434, 2003. [4] A. Buchsbau, G. owler, and R. Giancarlo. Iproving table copression with cobinatorial optiization. Journal of the ACM, 50(6):825 851, 2003. [5] S. Büttcher and C. L. A. Clarke. Index copression is good, especially for rando access. In CIKM, 2007. [6] S. Büttcher, C. L. A. Clarke, and G. V. Corack. Inforation retrieval: ipleenting and evaluating search engines. MIT Press, Cabridge, Mass., 2010. [7] T. H. Coren, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algoriths. The MIT Press, 2009. [8] M. Curtiss and et al. Unicorn: A syste for searching the social graph. VLDB, 6(11):1150 1161, Aug. 2013. [9] R. Delbru, S. Capinas, and G. Tuarello. Searching web data: An entity retrieval and high-perforance indexing odel. J. Web Se., 10:33 58, 2012. [10] P. Elias. Efficient storage and retrieval by content and address of static files. J. ACM, 21(2):246 260, 1974. [11] R. M. ano. On the nuber of bits required to ipleent an associative eory. Meorandu 61, Coputer Structures Group, MIT, Cabridge, MA, 1971. [12] P. erragina, I. Nitto, and R. Venturini. On optially partitioning a text to iprove its copression. Algorithica, 61(1):51 74, 2011. [13] A. og. The icroarchitecture of Intel, AMD and VIA CPUs. http: //www.agner.org/optiize/icroarchitecture.pdf. [14] J. Goldstein, R. Raakrishnan, and U. Shaft. Copressing relations and indexes. In ICDE, 1998. [15] D. Leire and L. Boytsov. Decoding billions of integers per second through vectorization. Software: Practice & Experience, 2013. [16] C. D. Manning, P. Raghavan, and H. Schülze. Introduction to Inforation Retrieval. Cabridge University Press, 2008. [17] A. Moffat and L. Stuiver. Binary interpolative coding for effective index copression. Inf. Retr., 3(1), 2000. [18] S. E. Robertson and K. S. Jones. Relevance weighting of search ters. Journal of the Aerican Society for Inforation science, 27(3):129 146, 1976. [19] D. Saloon. Variable-length Codes for Data Copression. Springer, 2007. [20]. Silvestri. Sorting out the docuent identifier assignent proble. In ECIR, pages 101 112, 2007. [21]. Silvestri and R. Venturini. VSEncoding: Efficient coding and fast decoding of integer lists via dynaic prograing. In CIKM, pages 1219 1228, 2010. [22] A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi. Sid-based decoding of posting lists. In CIKM, pages 317 326, 2011. [23] S. Vigna. Quasi-succinct indices. In WSDM, 2013. [24] I. H. Witten, A. Moffat, and T. C. Bell. Managing gigabytes (2nd ed.): copressing and indexing docuents and iages. Morgan Kaufann Publishers Inc., 1999. [25] H. Yan, S. Ding, and T. Suel. Copressing ter positions in web indexes. In SIGIR, pages 147 154, 2009. [26] H. Yan, S. Ding, and T. Suel. Inverted index copression and query processing with optiized docuent ordering. In WWW, pages 401 410, 2009. [27] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Coput. Surv., 38(2), 2006. [28] M. Zukowski, S. Hean, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache copression. In ICDE, 2006.