Succinct Data Structures: From Theory to Practice



Similar documents
Succinct Data Structures for Data Mining

Design of Practical Succinct Data Structures for Large Data Collections

Image Compression through DCT and Huffman Coding Technique

Compression techniques

PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

Searching BWT compressed text with the Boyer-Moore algorithm and binary search

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Information, Entropy, and Coding

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

SMALL INDEX LARGE INDEX (SILT)

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Original-page small file oriented EXT3 file storage system

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Arithmetic Coding: Introduction

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

The following themes form the major topics of this chapter: The terms and concepts related to trees (Section 5.2).

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

DNA Sequencing Data Compression. Michael Chung

ANALYSIS OF THE EFFECTIVENESS IN IMAGE COMPRESSION FOR CLOUD STORAGE FOR VARIOUS IMAGE FORMATS

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

5. Binary objects labeling

Search and Information Retrieval

Storage Management for Files of Dynamic Records

Analysis of Compression Algorithms for Program Data

Physical Data Organization

General Incremental Sliding-Window Aggregation

MS SQL Performance (Tuning) Best Practices:

Binary Trees and Huffman Encoding Binary Search Trees


Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Parquet. Columnar storage for the people

4.2 Sorting and Searching

A parallel algorithm for the extraction of structured motifs

Chapter 2 The Information Retrieval Process

SAP HANA In-Memory Database Sizing Guideline

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

Binary search tree with SIMD bandwidth optimization using SSE

How to use IBM HeapAnalyzer to diagnose Java heap issues

Hybrid Lossless Compression Method For Binary Images

CHAPTER 17: File Management

Raima Database Manager Version 14.0 In-memory Database Engine

Structure for String Keys

Vector storage and access; algorithms in GIS. This is lecture 6

Approximate String Matching in DNA Sequences

Efficient Processing of Joins on Set-valued Attributes

Developing MapReduce Programs

Introducing the Microsoft IIS deployment guide

GCE Computing. COMP3 Problem Solving, Programming, Operating Systems, Databases and Networking Report on the Examination.

Binary Coded Web Access Pattern Tree in Education Domain

Fast Arithmetic Coding (FastAC) Implementations

Topological Data Analysis Applications to Computer Vision

Binary Heap Algorithms

In-Memory Data Management for Enterprise Applications

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Hypertable Architecture Overview

Optimizing the Performance of Your Longview Application

Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse

Package PKI. July 28, 2015

How To Trace

Key Components of WAN Optimization Controller Functionality

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Comparison of different image compression formats. ECE 533 Project Report Paula Aguilera

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Suffix Tree Construction and Storage with Limited Main Memory

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Development and Evaluation of Point Cloud Compression for the Point Cloud Library

Storage Optimization in Cloud Environment using Compression Algorithm

Merkle Hash Trees for Distributed Audit Logs

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

Compression algorithm for Bayesian network modeling of binary systems

Integrating VoltDB with Hadoop

Maximizing Hadoop Performance with Hardware Compression

Chapter 2: Basics on computers and digital information coding. A.A Information Technology and Arts Organizations

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Linux+ Guide to Linux Certification, Third Edition. Chapter 11 Compression, System Backup, and Software Installation

Data Warehousing und Data Mining

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

NCPC 2013 Presentation of solutions

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

A Time Efficient Algorithm for Web Log Analysis

Analysis of Algorithms I: Optimal Binary Search Trees

Computer Aided Translation

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

Scalable Prefix Matching for Internet Packet Forwarding

Transcription:

Succinct Data Structures: From Theory to Practice Simon Gog Computing and Information Systems The University of Melbourne January 5th 24

Big data: Volume, Velocity,...

...and Variety Picture source IBM

...and Variety This talk addresses two Vs Variety: Index unstructured data Volume: Index larger data sets Velocity is not directly addressed Index: Construct once, query often Picture source IBM

Outline The Inverted Index: a success story 2 Suffix sorted based indexes: a tour de force 3 More virtues of the Wavelet Tree 4 Various Structures 5 SDSL: A toolbox of succinct structures 6 Conclusion

Index structures are the bases of information retrieval

Inverted Index Inverted Indexes for a collection of documents C used for web indexing practical in domains with well-defined vocabulary space is usually 5 % of C (without positions) space is about 5% if positions are included can not be used to reconstruct original text (parsing, stemming, stopping,...) original text has to be stored separately to allow snippet extraction

Inverted Index Documents (already normalized) d : is big data really big d : is it big in science d 2 : big data is big Inverted Lists big : {(,),(,4),(,2),(2,),(2,3)} data : {(,2),(2,)} in : {(,3)} is : {(,),(,),(2,2)} really : {(,3)} science : {(,4)}

Exact search using Inverted Index based systems () No document containing <line=break /> is returned.

Exact search using Inverted Index based systems (2) but it exists...

Summary: Inverted Indexes Advantages: Sequential locality in many operations (fast!) Index can be stored on disk Good compression for natural language text Disadvantages: Decision, what is searchable, made at parsing time No fast calculation of phrase queries No alphabet not applicable Adapt data to index structure Search quality seems to be enough for casual web search. Scientific tasks require high quality search.

Outline The Inverted Index: a success story 2 Suffix sorted based indexes: a tour de force 3 More virtues of the Wavelet Tree 4 Various Structures 5 SDSL: A toolbox of succinct structures 6 Conclusion

Suffix sorted based indexes Suffix sorted based indexes for a text T sort all n suffixes of T works also in domains with no well-defined vocabulary used in Bioinformatics space uncompressed: 5 to 5 size of T compressed: n H k (T ) bzip d T index structure adapts to data can be used to reconstruct original text (Self-Index) Not cover in this talk: LZ compressed indexes (works well for highly repetitive data)

H k of Pizza&Chili corpus texts (2MB versions) H k (T ) in bits contexts/ T in percent k DBLP.XML DNA ENGLISH PROTEINS 5.26..97. 4.53. 4.2. 3.48..93. 3.62. 4.8. 2 2.7..92. 2.95. 4.6. 3.43..92. 2.42. 4.7. 4.5.4.9. 2.6.3 3.83. 5.82.3.9..84. 3.6.7 6.7 2.7.88..67 2.7.5 7.4

Index size comparison (English text) size original text Suffix Tree Suffix Array Self-Index

Self-Index Operations Let P be a character pattern of length m = P. Operations: count(p) How many times does P occur in T? locate(p) List all count(p) positions where P occurs. extract(i,j) Extract text T [i..j] from the index. Operations are efficient (like O(m log σ), O(m log n))

Suffix Tree (ST) of T=umulmu 5 7 dumulmum$ $ 4 lmu $ u m u lmu m n = 6 Σ ={$,d,l,m,n,u} σ = 6 m$ 3 9m$ lmu m$ 6 m$ 2 4 2 3 $ 8m$ ulmu 5 O(n) construction (Weiner,973) Problem: Structure size: 2 T

Suffix Array (SA) of T=umulmu i 2 3 4 5 6 7 8 9 2 3 4 5 2 3 4 5 6 7 8 9 2 3 4 5 T umulmu mulmu ulmu lmu mu u dumulmum$ umulmum$ mulmum$ ulmum$ lmum$ mum$ um$ m$ $

Suffix Array (SA) of T=umulmu i SA T 2 3 4 5 6 7 8 9 2 3 4 5 5 7 3 4 9 2 4 6 2 3 8 5 $ dumulmum$ lmum$ lmu m$ mulmum$ mulmu mum$ mu ulmum$ ulmu um$ umulmum$ umulmu u Size: 5 T Binary search to answer count( P ) Match pattern from left to right: e.g. mum

Burrows-Wheeler Transform (BWT) of T i SA T BWT T 2 3 4 5 6 7 8 9 2 3 4 5 5 7 3 4 9 2 4 6 2 3 8 5 m n u u u u u l l u m m m d $ m $ dumulmum$ lmum$ lmu m$ mulmum$ mulmu mum$ mu ulmum$ ulmu um$ umulmum$ umulmu u Well-compressible New search algorithm (Ferragina & Manzini, 2): Backward search Does not require SA.

Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ Array C contains start of each character interval: $ a b c d r r+ 9 3 4 5 9 rank(i, c) = # occurences of c in T BWT [, i). Search backwards for bar.

Backward Search i T BWT T a $abracadabrabarbara r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = C[r]+rank(sp, r) ep = C[r]+rank(ep +, r)

Backward Search i T BWT T a $abracadabrabarbara r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = 5+rank(, r) ep = 5+rank(9, r)

Backward Search i T BWT T a $abracadabrabarbara r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = 5+ ep = 5+rank(9, r)

Backward Search i T BWT T a $abracadabrabarbara r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = 5+ = 5 ep = 5+4 = 8

Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = C[a]+rank(sp, a) ep 2 = C[a]+rank(ep +, a)

Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +rank(5, a) ep 2 = +rank(ep, a)

Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +rank(5, a) ep 2 = +rank(ep, a)

Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +6 ep 2 = +rank(9, a)

Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +6 = 7 ep 2 = +8 = 8

Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = C[b]+rank(sp 2, b) ep 3 = C[b]+rank(ep 2 +, b)

Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+rank(7, b) ep 3 = 9+rank(ep, b)

Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+rank(7, b) ep 3 = 9+rank(ep, b)

Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+ ep 3 = 9+rank(9, b)

Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r 9 3 4 5 Now search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+ = 9 ep 3 = 9+2 =

Conclusion Backward Search String matching on T only requires a structure which can answer rank queries on T BWT the C array Text T itself is not needed. We search a data structure represents T BWT compressed answers rank quickly

The Wavelet Tree (WT) Let X be a sequence of length n over alphabet Σ of size σ. Efficient Wavelet Tree operations: access(i) Recover X[i]. rank(i,c) How many times does c occur in X[, i)? select(i,c) Location of the i-th count(p) positions where P occurs. Space: n log σ + o(n log σ) + σ log n

WT operation rank arrd$rcbbraaaaaabba a$bbaaaaaabba a$aaaaaaa $ bbbb dc rrdrcr rrrr Only the bitvectors (dark gray) is stored code(a) = Reduce WT rank on bitvector rank aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2

WT operation rank arrd$rcbbraaaaaabba a$bbaaaaaabba a$aaaaaaa $ bbbb dc rrdrcr rrrr Only the bitvectors (dark gray) is stored code(a) = Reduce WT rank on bitvector rank aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2

WT operation rank arrd$rcbbraaaaaabba a$bbaaaaaabba a$aaaaaaa $ bbbb dc rrdrcr rrrr Only the bitvectors (dark gray) is stored code(a) = Reduce WT rank on bitvector rank aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2

WT operation rank arrd$rcbbraaaaaabba a$bbaaaaaabba a$aaaaaaa $ bbbb dc rrdrcr rrrr Only the bitvectors (dark gray) is stored code(a) = Reduce WT rank on bitvector rank aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2

Rank on bitvectors n + o(n) bit space and constant time solution Jacobson: Space-efficient Static Trees and Graphs, FOCS 989. Two level schema: Divide bitvector in superblocks of size log 2 n and blocks of log n/2. Store absolute prefix sums for superblocks. Relative perfix sums for blocks. Four Russian-Trick (Lookup-Table counting set bits) for blocks. Practice today Since 28 CPUs support popcount operation no need for Lookup-Tables.

Improving WT space: Huffman shape arrd$rcbbraaaaaabba $c d$c rrd$rcbbrbb d$cbbbb d bbbb rrrr aaaaaaaa Char c codeword(c) $ a b c d r $ c Avg. depth: H (BWT ). Total space: nh + 2σ log n

Other WT parametrizations Shape: balanced, Huffman-shape, Hu-Tucker-shape, Wavelet Matrix for large alphabets (decreases O(σ log n) to O(log σ log n)) run-length compressed versions parameterized bitvector

Experimental results: operation count 26 S. GOG AND M.PETRI Time per character in (µs) 8 6 4 2 XZ K =27 instance = WEB-64MB Index GZIP FM-HF-R 3 K FM-RLMN FM-HF-V FM-HF-L XZ K =27 K =63 instance = WEB-64GB Feature GZIP TBL BLT+HP K =63 K =5 2 4 6 8 K =5 2 4 6 8 Index size in (%) Index size in (%) (G & Petri, 24) Figure. Count time and space of our index implementations on input instances of different size with

Experimental results: construction Resource diagram for the construction of a Self-Index for 2MB of English text.

Outline The Inverted Index: a success story 2 Suffix sorted based indexes: a tour de force 3 More virtues of the Wavelet Tree 4 Various Structures 5 SDSL: A toolbox of succinct structures 6 Conclusion

More WT operations: Query point grids n σ Operations: count(x,x,y,y ) # points in rectangle [x, x ][y, y ]. report(x,x,y,y ) Report points in rectangle [x, x ][y, y ].

WT application: Top-k document retrieval T = b = 2 3 4 5 6 7 8 9 2 3 4 5 6 7 ω 2 ω ω 3 ω 3 # ω ω ω 4 ω # ω ω 4 ω 3 ω # ω 5 ω 5 # ω CSA position= 4 9 4 7 8 3 5 6 2 3 2 7 6 5 D = 2 3 2 2 2 2 3 3

WT application: Top-k document retrieval Represent document array D as wavelet tree First explore nodes, which contain the maximal interval Example: Search top-2 documents (frequency based ranking) 23222233 23222233 22222 333 Top documents containing ω :

WT application: Top-k document retrieval Represent document array D as wavelet tree First explore nodes, which contain the maximal interval Example: Search top-2 documents (frequency based ranking) 23222233 23222233 22222 333 Top documents containing ω :

WT application: Top-k document retrieval Represent document array D as wavelet tree First explore nodes, which contain the maximal interval Example: Search top-2 documents (frequency based ranking) 23222233 23222233 22222 333 Top documents containing ω :

WT application: Top-k document retrieval Represent document array D as wavelet tree First explore nodes, which contain the maximal interval Example: Search top-2 documents (frequency based ranking) 23222233 23222233 22222 333 Top documents containing ω : (3 times)

WT application: Top-k document retrieval Represent document array D as wavelet tree First explore nodes, which contain the maximal interval Example: Search top-2 documents (frequency based ranking) 23222233 23222233 22222 333 Top documents containing ω : (3 times), 2 (2 times)

WT application: Top-k document retrieval Other results: 2n + o(n) structure answers document frequency in constant time (Sadakane 27) Use RMQs for document listing (Sadakane 27) Prestore selected intervals (Hon et al. 29). Use skeleton Suffix Tree for mapping.

Outline The Inverted Index: a success story 2 Suffix sorted based indexes: a tour de force 3 More virtues of the Wavelet Tree 4 Various Structures 5 SDSL: A toolbox of succinct structures 6 Conclusion

Compressed Bitvectors: SD-array (Elias-Fano coding) Applications Graphs (debruijn subgraph, web graph,...) Lists (Inverted Lists,...) monotone sequence X with x x... x m n. Lower l = log n m bits stored explicitly. Upper bits as sequence H of unary encoded gaps Uses 2 + log n m bits per element Constant time access X by addding o(m) extra bits to H.

Compressed Bitvectors: SD-array (Elias-Fano coding) X = δ = 6 8 9 7 7 3 2 2 2 4 4 7 3-2- 2-2 4-2 4-4 7-4 2 3 H = L = 2 3 X[3]=rank (H, select (H, 4)) 2 2 + L[3] = 4 4 + = 7

Compressed Bitvectors: SD-array (Elias-Fano coding) Interpret X as positions of set bits in a bitvector b. Constant time acccess, rank, select on b. Well suited for sparse bitvectors. b = H = L = 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 select(b, 4) = rank (H, select (H, 4)) 2 2 + L[3] = 4 4 + = 7 rank(b, i) and access(b, i): use select on H to select right upper part + scan left for entry in L

Balanced Parentheses Sequences (BPS) Represent BPS as bitvector. (()()(()())(()((()())()()))()((()())(()(()()))())) find_open(i) Find matching closing paren of BPS[i]. find_close(p) Find matching opening paren of BPS[i]. enclose(i) Find tightest paren pair which encloses pair (i,find_close(i)) and more (double_enclose(i,j), rank(i), select(i),... ) can be answered in constant time by just adding o(n) space. (Munro 996, Geary et al. 24 & 26)

Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees 5 7 $ u dumulmum$ lmu m lmu u m lmu m$ $ ulmu m$ m$ 3 5 6 9 m$ 3 2 4 2 8 m$ 4 $ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees 5 7 $ u dumulmum$ lmu m lmu u m lmu m$ $ ulmu m$ m$ 3 5 6 9 m$ 3 2 4 2 8 m$ 4 $ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees 5 7 $ u dumulmum$ lmu m lmu u m lmu m$ $ ulmu m$ m$ 3 5 6 9 m$ 3 2 4 2 8 m$ 4 $ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees 5 7 $ u dumulmum$ lmu m lmu u m lmu m$ $ ulmu m$ m$ 3 5 6 9 m$ 3 2 4 2 8 m$ 4 $ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees 5 37 $ u dumulmum$ lmu m lmu u m lmu m$ $ ulmu m$ m$ 3 5 6 9 m$ 3 2 4 2 8 m$ 4 $ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees 5 37 $ u dumulmum$ lmu m lmu u m lmu m$ $ ulmu m$ m$ 3 5 6 9 m$ 3 2 4 2 8 m$ 4 $ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees $ u 5 dumulmum$ lmu m 37 m$ 4 $ lmum 5 u lmu m$ $ ulmu m$ m$ 3 5 6 9 m$ 3 2 4 2 8 tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees $ u 5 dumulmum$ lmu m 37 m$ 4 $ lmum 5 u lmu m$ 6 $ ulmu m$ m$ 3 5 6 9 m$ 3 2 4 2 8 tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees $ u 5 dumulmum$ lmu m 37 m$ 4 $ lmum 5 u lmu m$ 6 $ ulmu m$ m$ 3 5 6 9 m$ 3 2 4 2 8 tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees $ u 5 dumulmum$ lmu m 37 m$ 4 $ lmum 5 u lmu m$ 6 $ ulmu m$ m$ 3 5 6 9 m$ 83 2 4 2 8 tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees $ u 5 dumulmum$ lmu m 37 m$ 4 $ lmum 5 u lmu m$ 6 $ ulmu m$ m$ 3 5 6 9 m$ 83 2 4 2 8 tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees $ u 5 dumulmum$ lmu m 29 37 m$ 4 $ lmum 5 u 2 4 3 lmu m$ 36 6 $ ulmu 5 m$ m$ 3 3 37 46 5 39 27 6 6 9 m$ 83 2 2 23 4 33 2 4 8 8 42 tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

Balanced Parentheses Sequences (BPS) Application: Range Minimum Queries Array X of n elements from a well-ordered set. rmq(i,j) returns position of minimum element in X[i, j]. 2n + o(n) bits and O() time solution (Fischer & Heun 2) X = 4 6 2 99 5 3 2 74 32 2 3 4 5 6 7 8 9 rmq(2, 6)=3

Compressed Suffix Trees (CST) $ Full functionality... 5 7 root() 4 is leaf (v) parent(v) 6 3 5 degree(v) child(v, c) 3 2 select child(v, i) 4 2 sibling(v) depth(v) 5 7 3 4 9 2 4 6 2 3 8 5 lca(v, w) 3 5 2 2 4 2 6 sl(v), wl(v, c) (()()(()())(()((()())()()))()((()())(()(()()))())) id(v), inv_id(i) Size: CSA + LCP + BPS, i.e. about original text size dumulmum$ 9m$ m$ $ $ lmu lmu u m m$ u m$ lmu m 8m$ ulmu

Compressed Suffix Trees (CST) Construction Resource diagram for the construction of a CST for 2MB of English text.

Outline The Inverted Index: a success story 2 Suffix sorted based indexes: a tour de force 3 More virtues of the Wavelet Tree 4 Various Structures 5 SDSL: A toolbox of succinct structures 6 Conclusion

The Succinct Data Structure Library SDSL https://github.com/simongog/sdsl-lite Input from about 4 research papers Fast and space-efficient construction Index structures available for byte and integer alphabets Well-optimized (e.g. hardware POPCOUNT, hugepages,... ) Easy configuration of myriads of CSAs, CSTs, WTs Fast prototyping Tests, benchmarks, tutorial, cheat sheet

Outline The Inverted Index: a success story 2 Suffix sorted based indexes: a tour de force 3 More virtues of the Wavelet Tree 4 Various Structures 5 SDSL: A toolbox of succinct structures 6 Conclusion

Conclusion Limitations/Challenges of Succinct Structures: Often only work for a static setting Memory access pattern often non-local Implementation from scratch difficult Construction is often the bottleneck Algorithms have to be adapted to work fast with succinct structures

Conclusion Advantages of Succinct Structures: Allow indexation of larger data sets (about - larger compared to classical indexes). Adapt to the data: E.g. capture repetitions. Often provide more functionality. More functionality often produces conceptional easier solutions. Not limited to text indexing.

Conclusion Many applications heavily depend on Indexes. Self-Indexes made it possible to process Next Generation Sequencing in Genomics and Life Sciences. Why? Only the speed of self indexes matches the amazing throughput of sequencing technologies.

Thank you for your attention. Thank MASTODONS