Succinct Data Structures: From Theory to Practice

Size: px
Start display at page:

Download "Succinct Data Structures: From Theory to Practice"

Transcription

1 Succinct Data Structures: From Theory to Practice Simon Gog Computing and Information Systems The University of Melbourne January 5th 24

2 Big data: Volume, Velocity,...

3 ...and Variety Picture source IBM

4 ...and Variety This talk addresses two Vs Variety: Index unstructured data Volume: Index larger data sets Velocity is not directly addressed Index: Construct once, query often Picture source IBM

5 Outline The Inverted Index: a success story 2 Suffix sorted based indexes: a tour de force 3 More virtues of the Wavelet Tree 4 Various Structures 5 SDSL: A toolbox of succinct structures 6 Conclusion

6 Index structures are the bases of information retrieval

7 Inverted Index Inverted Indexes for a collection of documents C used for web indexing practical in domains with well-defined vocabulary space is usually 5 % of C (without positions) space is about 5% if positions are included can not be used to reconstruct original text (parsing, stemming, stopping,...) original text has to be stored separately to allow snippet extraction

8 Inverted Index Documents (already normalized) d : is big data really big d : is it big in science d 2 : big data is big Inverted Lists big : {(,),(,4),(,2),(2,),(2,3)} data : {(,2),(2,)} in : {(,3)} is : {(,),(,),(2,2)} really : {(,3)} science : {(,4)}

9 Exact search using Inverted Index based systems () No document containing <line=break /> is returned.

10 Exact search using Inverted Index based systems (2) but it exists...

11 Summary: Inverted Indexes Advantages: Sequential locality in many operations (fast!) Index can be stored on disk Good compression for natural language text Disadvantages: Decision, what is searchable, made at parsing time No fast calculation of phrase queries No alphabet not applicable Adapt data to index structure Search quality seems to be enough for casual web search. Scientific tasks require high quality search.

12 Outline The Inverted Index: a success story 2 Suffix sorted based indexes: a tour de force 3 More virtues of the Wavelet Tree 4 Various Structures 5 SDSL: A toolbox of succinct structures 6 Conclusion

13 Suffix sorted based indexes Suffix sorted based indexes for a text T sort all n suffixes of T works also in domains with no well-defined vocabulary used in Bioinformatics space uncompressed: 5 to 5 size of T compressed: n H k (T ) bzip d T index structure adapts to data can be used to reconstruct original text (Self-Index) Not cover in this talk: LZ compressed indexes (works well for highly repetitive data)

14 H k of Pizza&Chili corpus texts (2MB versions) H k (T ) in bits contexts/ T in percent k DBLP.XML DNA ENGLISH PROTEINS

15 Index size comparison (English text) size original text Suffix Tree Suffix Array Self-Index

16 Self-Index Operations Let P be a character pattern of length m = P. Operations: count(p) How many times does P occur in T? locate(p) List all count(p) positions where P occurs. extract(i,j) Extract text T [i..j] from the index. Operations are efficient (like O(m log σ), O(m log n))

17 Suffix Tree (ST) of T=umulmu 5 7 dumulmum$ $ 4 lmu $ u m u lmu m n = 6 Σ ={$,d,l,m,n,u} σ = 6 m$ 3 9m$ lmu m$ 6 m$ $ 8m$ ulmu 5 O(n) construction (Weiner,973) Problem: Structure size: 2 T

18 Suffix Array (SA) of T=umulmu i T umulmu mulmu ulmu lmu mu u dumulmum$ umulmum$ mulmum$ ulmum$ lmum$ mum$ um$ m$ $

19 Suffix Array (SA) of T=umulmu i SA T $ dumulmum$ lmum$ lmu m$ mulmum$ mulmu mum$ mu ulmum$ ulmu um$ umulmum$ umulmu u Size: 5 T Binary search to answer count( P ) Match pattern from left to right: e.g. mum

20 Burrows-Wheeler Transform (BWT) of T i SA T BWT T m n u u u u u l l u m m m d $ m $ dumulmum$ lmum$ lmu m$ mulmum$ mulmu mum$ mu ulmum$ ulmu um$ umulmum$ umulmu u Well-compressible New search algorithm (Ferragina & Manzini, 2): Backward search Does not require SA.

21 Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ Array C contains start of each character interval: $ a b c d r r rank(i, c) = # occurences of c in T BWT [, i). Search backwards for bar.

22 Backward Search i T BWT T a $abracadabrabarbara r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = C[r]+rank(sp, r) ep = C[r]+rank(ep +, r)

23 Backward Search i T BWT T a $abracadabrabarbara r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = 5+rank(, r) ep = 5+rank(9, r)

24 Backward Search i T BWT T a $abracadabrabarbara r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = 5+ ep = 5+rank(9, r)

25 Backward Search i T BWT T a $abracadabrabarbara r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = 5+ = 5 ep = 5+4 = 8

26 Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = C[a]+rank(sp, a) ep 2 = C[a]+rank(ep +, a)

27 Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +rank(5, a) ep 2 = +rank(ep, a)

28 Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +rank(5, a) ep 2 = +rank(ep, a)

29 Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +6 ep 2 = +rank(9, a)

30 Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +6 = 7 ep 2 = +8 = 8

31 Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = C[b]+rank(sp 2, b) ep 3 = C[b]+rank(ep 2 +, b)

32 Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+rank(7, b) ep 3 = 9+rank(ep, b)

33 Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+rank(7, b) ep 3 = 9+rank(ep, b)

34 Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+ ep 3 = 9+rank(9, b)

35 Backward Search i T BWT T a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Now search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+ = 9 ep 3 = 9+2 =

36 Conclusion Backward Search String matching on T only requires a structure which can answer rank queries on T BWT the C array Text T itself is not needed. We search a data structure represents T BWT compressed answers rank quickly

37 The Wavelet Tree (WT) Let X be a sequence of length n over alphabet Σ of size σ. Efficient Wavelet Tree operations: access(i) Recover X[i]. rank(i,c) How many times does c occur in X[, i)? select(i,c) Location of the i-th count(p) positions where P occurs. Space: n log σ + o(n log σ) + σ log n

38 WT operation rank arrd$rcbbraaaaaabba a$bbaaaaaabba a$aaaaaaa $ bbbb dc rrdrcr rrrr Only the bitvectors (dark gray) is stored code(a) = Reduce WT rank on bitvector rank aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2

39 WT operation rank arrd$rcbbraaaaaabba a$bbaaaaaabba a$aaaaaaa $ bbbb dc rrdrcr rrrr Only the bitvectors (dark gray) is stored code(a) = Reduce WT rank on bitvector rank aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2

40 WT operation rank arrd$rcbbraaaaaabba a$bbaaaaaabba a$aaaaaaa $ bbbb dc rrdrcr rrrr Only the bitvectors (dark gray) is stored code(a) = Reduce WT rank on bitvector rank aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2

41 WT operation rank arrd$rcbbraaaaaabba a$bbaaaaaabba a$aaaaaaa $ bbbb dc rrdrcr rrrr Only the bitvectors (dark gray) is stored code(a) = Reduce WT rank on bitvector rank aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2

42 Rank on bitvectors n + o(n) bit space and constant time solution Jacobson: Space-efficient Static Trees and Graphs, FOCS 989. Two level schema: Divide bitvector in superblocks of size log 2 n and blocks of log n/2. Store absolute prefix sums for superblocks. Relative perfix sums for blocks. Four Russian-Trick (Lookup-Table counting set bits) for blocks. Practice today Since 28 CPUs support popcount operation no need for Lookup-Tables.

43 Improving WT space: Huffman shape arrd$rcbbraaaaaabba $c d$c rrd$rcbbrbb d$cbbbb d bbbb rrrr aaaaaaaa Char c codeword(c) $ a b c d r $ c Avg. depth: H (BWT ). Total space: nh + 2σ log n

44 Other WT parametrizations Shape: balanced, Huffman-shape, Hu-Tucker-shape, Wavelet Matrix for large alphabets (decreases O(σ log n) to O(log σ log n)) run-length compressed versions parameterized bitvector

45 Experimental results: operation count 26 S. GOG AND M.PETRI Time per character in (µs) XZ K =27 instance = WEB-64MB Index GZIP FM-HF-R 3 K FM-RLMN FM-HF-V FM-HF-L XZ K =27 K =63 instance = WEB-64GB Feature GZIP TBL BLT+HP K =63 K = K = Index size in (%) Index size in (%) (G & Petri, 24) Figure. Count time and space of our index implementations on input instances of different size with

46 Experimental results: construction Resource diagram for the construction of a Self-Index for 2MB of English text.

47 Outline The Inverted Index: a success story 2 Suffix sorted based indexes: a tour de force 3 More virtues of the Wavelet Tree 4 Various Structures 5 SDSL: A toolbox of succinct structures 6 Conclusion

48 More WT operations: Query point grids n σ Operations: count(x,x,y,y ) # points in rectangle [x, x ][y, y ]. report(x,x,y,y ) Report points in rectangle [x, x ][y, y ].

49 WT application: Top-k document retrieval T = b = ω 2 ω ω 3 ω 3 # ω ω ω 4 ω # ω ω 4 ω 3 ω # ω 5 ω 5 # ω CSA position= D =

50 WT application: Top-k document retrieval Represent document array D as wavelet tree First explore nodes, which contain the maximal interval Example: Search top-2 documents (frequency based ranking) Top documents containing ω :

51 WT application: Top-k document retrieval Represent document array D as wavelet tree First explore nodes, which contain the maximal interval Example: Search top-2 documents (frequency based ranking) Top documents containing ω :

52 WT application: Top-k document retrieval Represent document array D as wavelet tree First explore nodes, which contain the maximal interval Example: Search top-2 documents (frequency based ranking) Top documents containing ω :

53 WT application: Top-k document retrieval Represent document array D as wavelet tree First explore nodes, which contain the maximal interval Example: Search top-2 documents (frequency based ranking) Top documents containing ω : (3 times)

54 WT application: Top-k document retrieval Represent document array D as wavelet tree First explore nodes, which contain the maximal interval Example: Search top-2 documents (frequency based ranking) Top documents containing ω : (3 times), 2 (2 times)

55 WT application: Top-k document retrieval Other results: 2n + o(n) structure answers document frequency in constant time (Sadakane 27) Use RMQs for document listing (Sadakane 27) Prestore selected intervals (Hon et al. 29). Use skeleton Suffix Tree for mapping.

56 Outline The Inverted Index: a success story 2 Suffix sorted based indexes: a tour de force 3 More virtues of the Wavelet Tree 4 Various Structures 5 SDSL: A toolbox of succinct structures 6 Conclusion

57 Compressed Bitvectors: SD-array (Elias-Fano coding) Applications Graphs (debruijn subgraph, web graph,...) Lists (Inverted Lists,...) monotone sequence X with x x... x m n. Lower l = log n m bits stored explicitly. Upper bits as sequence H of unary encoded gaps Uses 2 + log n m bits per element Constant time access X by addding o(m) extra bits to H.

58 Compressed Bitvectors: SD-array (Elias-Fano coding) X = δ = H = L = 2 3 X[3]=rank (H, select (H, 4)) L[3] = = 7

59 Compressed Bitvectors: SD-array (Elias-Fano coding) Interpret X as positions of set bits in a bitvector b. Constant time acccess, rank, select on b. Well suited for sparse bitvectors. b = H = L = select(b, 4) = rank (H, select (H, 4)) L[3] = = 7 rank(b, i) and access(b, i): use select on H to select right upper part + scan left for entry in L

60 Balanced Parentheses Sequences (BPS) Represent BPS as bitvector. (()()(()())(()((()())()()))()((()())(()(()()))())) find_open(i) Find matching closing paren of BPS[i]. find_close(p) Find matching opening paren of BPS[i]. enclose(i) Find tightest paren pair which encloses pair (i,find_close(i)) and more (double_enclose(i,j), rank(i), select(i),... ) can be answered in constant time by just adding o(n) space. (Munro 996, Geary et al. 24 & 26)

61 Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees 5 7 $ u dumulmum$ lmu m lmu u m lmu m$ $ ulmu m$ m$ m$ m$ 4 $ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

62 Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees 5 7 $ u dumulmum$ lmu m lmu u m lmu m$ $ ulmu m$ m$ m$ m$ 4 $ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

63 Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees 5 7 $ u dumulmum$ lmu m lmu u m lmu m$ $ ulmu m$ m$ m$ m$ 4 $ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

64 Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees 5 7 $ u dumulmum$ lmu m lmu u m lmu m$ $ ulmu m$ m$ m$ m$ 4 $ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

65 Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees 5 37 $ u dumulmum$ lmu m lmu u m lmu m$ $ ulmu m$ m$ m$ m$ 4 $ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

66 Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees 5 37 $ u dumulmum$ lmu m lmu u m lmu m$ $ ulmu m$ m$ m$ m$ 4 $ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

67 Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees $ u 5 dumulmum$ lmu m 37 m$ 4 $ lmum 5 u lmu m$ $ ulmu m$ m$ m$ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

68 Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees $ u 5 dumulmum$ lmu m 37 m$ 4 $ lmum 5 u lmu m$ 6 $ ulmu m$ m$ m$ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

69 Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees $ u 5 dumulmum$ lmu m 37 m$ 4 $ lmum 5 u lmu m$ 6 $ ulmu m$ m$ m$ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

70 Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees $ u 5 dumulmum$ lmu m 37 m$ 4 $ lmum 5 u lmu m$ 6 $ ulmu m$ m$ m$ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

71 Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees $ u 5 dumulmum$ lmu m 37 m$ 4 $ lmum 5 u lmu m$ 6 $ ulmu m$ m$ m$ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

72 Balanced Parentheses Sequences (BPS) Application: Succinct representation of trees $ u 5 dumulmum$ lmu m m$ 4 $ lmum 5 u lmu m$ 36 6 $ ulmu 5 m$ m$ m$ tree uncompressed O(n log n) bits BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4n bits

73 Balanced Parentheses Sequences (BPS) Application: Range Minimum Queries Array X of n elements from a well-ordered set. rmq(i,j) returns position of minimum element in X[i, j]. 2n + o(n) bits and O() time solution (Fischer & Heun 2) X = rmq(2, 6)=3

74 Compressed Suffix Trees (CST) $ Full functionality root() 4 is leaf (v) parent(v) degree(v) child(v, c) 3 2 select child(v, i) 4 2 sibling(v) depth(v) lca(v, w) sl(v), wl(v, c) (()()(()())(()((()())()()))()((()())(()(()()))())) id(v), inv_id(i) Size: CSA + LCP + BPS, i.e. about original text size dumulmum$ 9m$ m$ $ $ lmu lmu u m m$ u m$ lmu m 8m$ ulmu

75 Compressed Suffix Trees (CST) Construction Resource diagram for the construction of a CST for 2MB of English text.

76 Outline The Inverted Index: a success story 2 Suffix sorted based indexes: a tour de force 3 More virtues of the Wavelet Tree 4 Various Structures 5 SDSL: A toolbox of succinct structures 6 Conclusion

77 The Succinct Data Structure Library SDSL Input from about 4 research papers Fast and space-efficient construction Index structures available for byte and integer alphabets Well-optimized (e.g. hardware POPCOUNT, hugepages,... ) Easy configuration of myriads of CSAs, CSTs, WTs Fast prototyping Tests, benchmarks, tutorial, cheat sheet

78 Outline The Inverted Index: a success story 2 Suffix sorted based indexes: a tour de force 3 More virtues of the Wavelet Tree 4 Various Structures 5 SDSL: A toolbox of succinct structures 6 Conclusion

79 Conclusion Limitations/Challenges of Succinct Structures: Often only work for a static setting Memory access pattern often non-local Implementation from scratch difficult Construction is often the bottleneck Algorithms have to be adapted to work fast with succinct structures

80 Conclusion Advantages of Succinct Structures: Allow indexation of larger data sets (about - larger compared to classical indexes). Adapt to the data: E.g. capture repetitions. Often provide more functionality. More functionality often produces conceptional easier solutions. Not limited to text indexing.

81 Conclusion Many applications heavily depend on Indexes. Self-Indexes made it possible to process Next Generation Sequencing in Genomics and Life Sciences. Why? Only the speed of self indexes matches the amazing throughput of sequencing technologies.

82 Thank you for your attention. Thank MASTODONS

Succinct Data Structures for Data Mining

Succinct Data Structures for Data Mining Succinct Data Structures for Data Mining Rajeev Raman University of Leicester ALSIP 2014, Tainan Overview Introduction Compressed Data Structuring Data Structures Applications Libraries End Big Data vs.

More information

Design of Practical Succinct Data Structures for Large Data Collections

Design of Practical Succinct Data Structures for Large Data Collections Design of Practical Succinct Data Structures for Large Data Collections Roberto Grossi and Giuseppe Ottaviano Dipartimento di Informatica Università di Pisa {grossi,ottaviano}@di.unipi.it Abstract. We

More information

Compressed Text Indexes with Fast Locate

Compressed Text Indexes with Fast Locate Compressed Text Indexes with Fast Locate Rodrigo González and Gonzalo Navarro Dept. of Computer Science, University of Chile. {rgonzale,gnavarro}@dcc.uchile.cl Abstract. Compressed text (self-)indexes

More information

Image Compression through DCT and Huffman Coding Technique

Image Compression through DCT and Huffman Coding Technique International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul

More information

Position-Restricted Substring Searching

Position-Restricted Substring Searching Position-Restricted Substring Searching Veli Mäkinen 1 and Gonzalo Navarro 2 1 AG Genominformatik, Technische Fakultät Universität Bielefeld, Germany. veli@cebitec.uni-bielefeld.de 2 Center for Web Research

More information

Compression techniques

Compression techniques Compression techniques David Bařina February 22, 2013 David Bařina Compression techniques February 22, 2013 1 / 37 Contents 1 Terminology 2 Simple techniques 3 Entropy coding 4 Dictionary methods 5 Conclusion

More information

PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES

PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES Rodrigo González Dept. of Computer Science, University of Chile, Chile rgonzale@dcc.uchile.cl Szymon Grabowski Computer Engineering Department, Technical

More information

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm 1. LZ77:Sliding Window Lempel-Ziv Algorithm [gzip, pkzip] Encode a string by finding the longest match anywhere within a window of past symbols

More information

Searching BWT compressed text with the Boyer-Moore algorithm and binary search

Searching BWT compressed text with the Boyer-Moore algorithm and binary search Searching BWT compressed text with the Boyer-Moore algorithm and binary search Tim Bell 1 Matt Powell 1 Amar Mukherjee 2 Don Adjeroh 3 November 2001 Abstract: This paper explores two techniques for on-line

More information

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research Algorithmic Techniques for Big Data Analysis Barna Saha AT&T Lab-Research Challenges of Big Data VOLUME Large amount of data VELOCITY Needs to be analyzed quickly VARIETY Different types of structured

More information

Information, Entropy, and Coding

Information, Entropy, and Coding Chapter 8 Information, Entropy, and Coding 8. The Need for Data Compression To motivate the material in this chapter, we first consider various data sources and some estimates for the amount of data associated

More information

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b LZ77 The original LZ77 algorithm works as follows: A phrase T j starting at a position i is encoded as a triple of the form distance, length, symbol. A triple d, l, s means that: T j = T [i...i + l] =

More information

SMALL INDEX LARGE INDEX (SILT)

SMALL INDEX LARGE INDEX (SILT) Wayne State University ECE 7650: Scalable and Secure Internet Services and Architecture SMALL INDEX LARGE INDEX (SILT) A Memory Efficient High Performance Key Value Store QA REPORT Instructor: Dr. Song

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Original-page small file oriented EXT3 file storage system

Original-page small file oriented EXT3 file storage system Original-page small file oriented EXT3 file storage system Zhang Weizhe, Hui He, Zhang Qizhen School of Computer Science and Technology, Harbin Institute of Technology, Harbin E-mail: wzzhang@hit.edu.cn

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Arithmetic Coding: Introduction

Arithmetic Coding: Introduction Data Compression Arithmetic coding Arithmetic Coding: Introduction Allows using fractional parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation

More information

ESTRUCTURAS DE DATOS SUCINTAS PARA RECUPERACIÓN DE DOCUMENTOS

ESTRUCTURAS DE DATOS SUCINTAS PARA RECUPERACIÓN DE DOCUMENTOS UNIVERSIDAD DE CHILE FACULTAD DE CIENCIAS FÍSICAS Y MATEMÁTICAS DEPARTAMENTO DE CIENCIAS DE LA COMPUTACIÓN ESTRUCTURAS DE DATOS SUCINTAS PARA RECUPERACIÓN DE DOCUMENTOS TESIS PARA OPTAR AL TÍTULO DE GRADO

More information

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka. 1 Subject Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka. This document extends a previous tutorial dedicated to the comparison of various implementations

More information

The following themes form the major topics of this chapter: The terms and concepts related to trees (Section 5.2).

The following themes form the major topics of this chapter: The terms and concepts related to trees (Section 5.2). CHAPTER 5 The Tree Data Model There are many situations in which information has a hierarchical or nested structure like that found in family trees or organization charts. The abstraction that models hierarchical

More information

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems Chapter 13 File and Database Systems Outline 13.1 Introduction 13.2 Data Hierarchy 13.3 Files 13.4 File Systems 13.4.1 Directories 13.4. Metadata 13.4. Mounting 13.5 File Organization 13.6 File Allocation

More information

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems Chapter 13 File and Database Systems Outline 13.1 Introduction 13.2 Data Hierarchy 13.3 Files 13.4 File Systems 13.4.1 Directories 13.4. Metadata 13.4. Mounting 13.5 File Organization 13.6 File Allocation

More information

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky DNA Mapping/Alignment Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky Overview Summary Research Paper 1 Research Paper 2 Research Paper 3 Current Progress Software Designs to

More information

DNA Sequencing Data Compression. Michael Chung

DNA Sequencing Data Compression. Michael Chung DNA Sequencing Data Compression Michael Chung Problem DNA sequencing per dollar is increasing faster than storage capacity per dollar. Stein (2010) Data 3 billion base pairs in human genome Genomes are

More information

ANALYSIS OF THE EFFECTIVENESS IN IMAGE COMPRESSION FOR CLOUD STORAGE FOR VARIOUS IMAGE FORMATS

ANALYSIS OF THE EFFECTIVENESS IN IMAGE COMPRESSION FOR CLOUD STORAGE FOR VARIOUS IMAGE FORMATS ANALYSIS OF THE EFFECTIVENESS IN IMAGE COMPRESSION FOR CLOUD STORAGE FOR VARIOUS IMAGE FORMATS Dasaradha Ramaiah K. 1 and T. Venugopal 2 1 IT Department, BVRIT, Hyderabad, India 2 CSE Department, JNTUH,

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

5. Binary objects labeling

5. Binary objects labeling Image Processing - Laboratory 5: Binary objects labeling 1 5. Binary objects labeling 5.1. Introduction In this laboratory an object labeling algorithm which allows you to label distinct objects from a

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Storage Management for Files of Dynamic Records

Storage Management for Files of Dynamic Records Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. jz@cs.rmit.edu.au Alistair Moffat Department of Computer Science

More information

Analysis of Compression Algorithms for Program Data

Analysis of Compression Algorithms for Program Data Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory

More information

Physical Data Organization

Physical Data Organization Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

More information

General Incremental Sliding-Window Aggregation

General Incremental Sliding-Window Aggregation General Incremental Sliding-Window Aggregation Kanat Tangwongsan Mahidol University International College, Thailand (part of this work done at IBM Research, Watson) Joint work with Martin Hirzel, Scott

More information

MS SQL Performance (Tuning) Best Practices:

MS SQL Performance (Tuning) Best Practices: MS SQL Performance (Tuning) Best Practices: 1. Don t share the SQL server hardware with other services If other workloads are running on the same server where SQL Server is running, memory and other hardware

More information

Binary Trees and Huffman Encoding Binary Search Trees

Binary Trees and Huffman Encoding Binary Search Trees Binary Trees and Huffman Encoding Binary Search Trees Computer Science E119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Motivation: Maintaining a Sorted Collection of Data A data dictionary

More information

www.gr8ambitionz.com

www.gr8ambitionz.com Data Base Management Systems (DBMS) Study Material (Objective Type questions with Answers) Shared by Akhil Arora Powered by www. your A to Z competitive exam guide Database Objective type questions Q.1

More information

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran Wan Accelerators: Optimizing Network Traffic with Compression Bartosz Agas, Marvin Germar & Christopher Tran Introduction A WAN accelerator is an appliance that can maximize the services of a point-to-point(ptp)

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Parquet. Columnar storage for the people

Parquet. Columnar storage for the people Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala Outline Context from various

More information

4.2 Sorting and Searching

4.2 Sorting and Searching Sequential Search: Java Implementation 4.2 Sorting and Searching Scan through array, looking for key. search hit: return array index search miss: return -1 public static int search(string key, String[]

More information

A parallel algorithm for the extraction of structured motifs

A parallel algorithm for the extraction of structured motifs parallel algorithm for the extraction of structured motifs lexandra arvalho MEI 2002/03 omputação em Sistemas Distribuídos 2003 p.1/27 Plan of the talk Biological model of regulation Nucleic acids: DN

More information

Chapter 2 The Information Retrieval Process

Chapter 2 The Information Retrieval Process Chapter 2 The Information Retrieval Process Abstract What does an information retrieval system look like from a bird s eye perspective? How can a set of documents be processed by a system to make sense

More information

SAP HANA In-Memory Database Sizing Guideline

SAP HANA In-Memory Database Sizing Guideline SAP HANA In-Memory Database Sizing Guideline Version 1.4 August 2013 2 DISCLAIMER Sizing recommendations apply for certified hardware only. Please contact hardware vendor for suitable hardware configuration.

More information

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework Sergio De Agostino Computer Science Department Sapienza University of Rome Internet as a Distributed System Modern

More information

Indexing and Compression of Text

Indexing and Compression of Text Compressing the Digital Library Timothy C. Bell 1, Alistair Moffat 2, and Ian H. Witten 3 1 Department of Computer Science, University of Canterbury, New Zealand, tim@cosc.canterbury.ac.nz 2 Department

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

How to use IBM HeapAnalyzer to diagnose Java heap issues

How to use IBM HeapAnalyzer to diagnose Java heap issues IBM Software Group How to use IBM HeapAnalyzer to diagnose Java heap issues Jinwoo Hwang (jinwoo@us.ibm.com) IBM HeapAnalyzer Architect/Developer WebSphere Support Technical Exchange Introduction Java

More information

Hybrid Lossless Compression Method For Binary Images

Hybrid Lossless Compression Method For Binary Images M.F. TALU AND İ. TÜRKOĞLU/ IU-JEEE Vol. 11(2), (2011), 1399-1405 Hybrid Lossless Compression Method For Binary Images M. Fatih TALU, İbrahim TÜRKOĞLU Inonu University, Dept. of Computer Engineering, Engineering

More information

CHAPTER 17: File Management

CHAPTER 17: File Management CHAPTER 17: File Management The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides

More information

Raima Database Manager Version 14.0 In-memory Database Engine

Raima Database Manager Version 14.0 In-memory Database Engine + Raima Database Manager Version 14.0 In-memory Database Engine By Jeffrey R. Parsons, Senior Engineer January 2016 Abstract Raima Database Manager (RDM) v14.0 contains an all new data storage engine optimized

More information

Structure for String Keys

Structure for String Keys Burst Tries: A Fast, Efficient Data Structure for String Keys Steen Heinz Justin Zobel Hugh E. Williams School of Computer Science and Information Technology, RMIT University Presented by Margot Schips

More information

Vector storage and access; algorithms in GIS. This is lecture 6

Vector storage and access; algorithms in GIS. This is lecture 6 Vector storage and access; algorithms in GIS This is lecture 6 Vector data storage and access Vectors are built from points, line and areas. (x,y) Surface: (x,y,z) Vector data access Access to vector

More information

Approximate String Matching in DNA Sequences

Approximate String Matching in DNA Sequences Approximate String Matching in DNA Sequences Lok-Lam Cheng David W. Cheung Siu-Ming Yiu Department of Computer Science and Infomation Systems, The University of Hong Kong, Pokflum Road, Hong Kong {llcheng,dcheung,smyiu}@csis.hku.hk

More information

Efficient Processing of Joins on Set-valued Attributes

Efficient Processing of Joins on Set-valued Attributes Efficient Processing of Joins on Set-valued Attributes Nikos Mamoulis Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road Hong Kong nikos@csis.hku.hk Abstract Object-oriented

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Introducing the Microsoft IIS deployment guide

Introducing the Microsoft IIS deployment guide Deployment Guide Deploying Microsoft Internet Information Services with the BIG-IP System Introducing the Microsoft IIS deployment guide F5 s BIG-IP system can increase the existing benefits of deploying

More information

GCE Computing. COMP3 Problem Solving, Programming, Operating Systems, Databases and Networking Report on the Examination.

GCE Computing. COMP3 Problem Solving, Programming, Operating Systems, Databases and Networking Report on the Examination. GCE Computing COMP3 Problem Solving, Programming, Operating Systems, Databases and Networking Report on the Examination 2510 Summer 2014 Version: 1.0 Further copies of this Report are available from aqa.org.uk

More information

Binary Coded Web Access Pattern Tree in Education Domain

Binary Coded Web Access Pattern Tree in Education Domain Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: kc.gomathi@gmail.com M. Moorthi

More information

Fast Arithmetic Coding (FastAC) Implementations

Fast Arithmetic Coding (FastAC) Implementations Fast Arithmetic Coding (FastAC) Implementations Amir Said 1 Introduction This document describes our fast implementations of arithmetic coding, which achieve optimal compression and higher throughput by

More information

Topological Data Analysis Applications to Computer Vision

Topological Data Analysis Applications to Computer Vision Topological Data Analysis Applications to Computer Vision Vitaliy Kurlin, http://kurlin.org Microsoft Research Cambridge and Durham University, UK Topological Data Analysis quantifies topological structures

More information

Binary Heap Algorithms

Binary Heap Algorithms CS Data Structures and Algorithms Lecture Slides Wednesday, April 5, 2009 Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks CHAPPELLG@member.ams.org 2005 2009 Glenn G. Chappell

More information

In-Memory Data Management for Enterprise Applications

In-Memory Data Management for Enterprise Applications In-Memory Data Management for Enterprise Applications Jens Krueger Senior Researcher and Chair Representative Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University

More information

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file? Files What s it all about? Information being stored about anything important to the business/individual keeping the files. The simple concepts used in the operation of manual files are often a good guide

More information

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1 Slide 13-1 Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Optimizing the Performance of Your Longview Application

Optimizing the Performance of Your Longview Application Optimizing the Performance of Your Longview Application François Lalonde, Director Application Support May 15, 2013 Disclaimer This presentation is provided to you solely for information purposes, is not

More information

Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse

Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse Zainab Qays Abdulhadi* * Ministry of Higher Education & Scientific Research Baghdad, Iraq Zhang Zuping Hamed Ibrahim Housien**

More information

Package PKI. July 28, 2015

Package PKI. July 28, 2015 Version 0.1-3 Package PKI July 28, 2015 Title Public Key Infrastucture for R Based on the X.509 Standard Author Maintainer Depends R (>= 2.9.0),

More information

How To Trace

How To Trace CS510 Software Engineering Dynamic Program Analysis Asst. Prof. Mathias Payer Department of Computer Science Purdue University TA: Scott A. Carr Slides inspired by Xiangyu Zhang http://nebelwelt.net/teaching/15-cs510-se

More information

Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems

Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems Ingo Müller #, Cornelius Ratsch #, Franz Faerber # ingo.mueller@kit.edu, cornelius.ratsch@sap.com, franz.faerber@sap.com

More information

Key Components of WAN Optimization Controller Functionality

Key Components of WAN Optimization Controller Functionality Key Components of WAN Optimization Controller Functionality Introduction and Goals One of the key challenges facing IT organizations relative to application and service delivery is ensuring that the applications

More information

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Chapter 13 Disk Storage, Basic File Structures, and Hashing. Chapter 13 Disk Storage, Basic File Structures, and Hashing. Copyright 2004 Pearson Education, Inc. Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files

More information

zdelta: An Efficient Delta Compression Tool

zdelta: An Efficient Delta Compression Tool zdelta: An Efficient Delta Compression Tool Dimitre Trendafilov Nasir Memon Torsten Suel Department of Computer and Information Science Technical Report TR-CIS-2002-02 6/26/2002 zdelta: An Efficient Delta

More information

Modeling Service Oriented Architectures In A Command and Control Context. Jim Solderitsch arces-dev@gestalt-llc.com

Modeling Service Oriented Architectures In A Command and Control Context. Jim Solderitsch arces-dev@gestalt-llc.com Modeling Service Oriented Architectures In A Command and Control Context Jim Solderitsch arces-dev@gestalt-llc.com Outline Project Overview Technology Background Service Oriented Architecture (SOA) Enterprise

More information

Comparison of different image compression formats. ECE 533 Project Report Paula Aguilera

Comparison of different image compression formats. ECE 533 Project Report Paula Aguilera Comparison of different image compression formats ECE 533 Project Report Paula Aguilera Introduction: Images are very important documents nowadays; to work with them in some applications they need to be

More information

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Chapter 13. Disk Storage, Basic File Structures, and Hashing Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible Hashing

More information

Suffix Tree Construction and Storage with Limited Main Memory

Suffix Tree Construction and Storage with Limited Main Memory Universität Bielefeld Technische Fakultät Abteilung Informationstechnik Forschungsberichte Suffix Tree Construction and Storage with Limited Main Memory Klaus-Bernd Schürmann Jens Stoye Report 2003-06

More information

CS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

CS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions CS 2112 Spring 2014 Assignment 3 Data Structures and Web Filtering Due: March 4, 2014 11:59 PM Implementing spam blacklists and web filters requires matching candidate domain names and URLs very rapidly

More information

Development and Evaluation of Point Cloud Compression for the Point Cloud Library

Development and Evaluation of Point Cloud Compression for the Point Cloud Library Development and Evaluation of Point Cloud Compression for the Institute for Media Technology, TUM, Germany May 12, 2011 Motivation Point Cloud Stream Compression Network Point Cloud Stream Decompression

More information

Storage Optimization in Cloud Environment using Compression Algorithm

Storage Optimization in Cloud Environment using Compression Algorithm Storage Optimization in Cloud Environment using Compression Algorithm K.Govinda 1, Yuvaraj Kumar 2 1 School of Computing Science and Engineering, VIT University, Vellore, India kgovinda@vit.ac.in 2 School

More information

Merkle Hash Trees for Distributed Audit Logs

Merkle Hash Trees for Distributed Audit Logs Merkle Hash Trees for Distributed Audit Logs Subject proposed by Karthikeyan Bhargavan Karthikeyan.Bhargavan@inria.fr April 7, 2015 Modern distributed systems spread their databases across a large number

More information

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching A Partition-Based Efficient Algorithm for Large Scale Multiple-Strings Matching Ping Liu Jianlong Tan, Yanbing Liu Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing,

More information

Compression algorithm for Bayesian network modeling of binary systems

Compression algorithm for Bayesian network modeling of binary systems Compression algorithm for Bayesian network modeling of binary systems I. Tien & A. Der Kiureghian University of California, Berkeley ABSTRACT: A Bayesian network (BN) is a useful tool for analyzing the

More information

Integrating VoltDB with Hadoop

Integrating VoltDB with Hadoop The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

More information

Maximizing Hadoop Performance with Hardware Compression

Maximizing Hadoop Performance with Hardware Compression Maximizing Hadoop Performance with Hardware Compression Robert Reiner Director of Marketing Compression and Security Exar Corporation November 2012 1 What is Big? sets whose size is beyond the ability

More information

Chapter 2: Basics on computers and digital information coding. A.A. 2012-2013 Information Technology and Arts Organizations

Chapter 2: Basics on computers and digital information coding. A.A. 2012-2013 Information Technology and Arts Organizations Chapter 2: Basics on computers and digital information coding Information Technology and Arts Organizations 1 Syllabus (1/3) 1. Introduction on Information Technologies (IT) and Cultural Heritage (CH)

More information

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching COSC 348: Computing for Bioinformatics Definitions A pattern (keyword) is an ordered sequence of symbols. Lecture 4: Exact string searching algorithms Lubica Benuskova http://www.cs.otago.ac.nz/cosc348/

More information

Linux+ Guide to Linux Certification, Third Edition. Chapter 11 Compression, System Backup, and Software Installation

Linux+ Guide to Linux Certification, Third Edition. Chapter 11 Compression, System Backup, and Software Installation Linux+ Guide to Linux Certification, Third Edition Chapter 11 Compression, System Backup, and Software Installation Objectives Outline the features of common compression utilities Compress and decompress

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches Column-Stores Horizontal/Vertical Partitioning Horizontal Partitions Master Table Vertical Partitions Primary Key 3 Motivation

More information

Data Warehousing und Data Mining

Data Warehousing und Data Mining Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

More information

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing Chapter 13 Disk Storage, Basic File Structures, and Hashing Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files

More information

NCPC 2013 Presentation of solutions

NCPC 2013 Presentation of solutions NCPC 2013 Presentation of solutions Head of Jury: Lukáš Poláček 2013-10-05 NCPC Jury authors Andreas Björklund (LTH) Jaap Eldering (Imperial) Daniel Espling (UMU) Christian Ledig (Imperial) Ulf Lundström

More information

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Why? A central concept in Computer Science. Algorithms are ubiquitous. Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

More information

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)? Database Indexes How costly is this operation (naive solution)? course per weekday hour room TDA356 2 VR Monday 13:15 TDA356 2 VR Thursday 08:00 TDA356 4 HB1 Tuesday 08:00 TDA356 4 HB1 Friday 13:15 TIN090

More information

A Time Efficient Algorithm for Web Log Analysis

A Time Efficient Algorithm for Web Log Analysis A Time Efficient Algorithm for Web Log Analysis Santosh Shakya Anju Singh Divakar Singh Student [M.Tech.6 th sem (CSE)] Asst.Proff, Dept. of CSE BU HOD (CSE), BUIT, BUIT,BU Bhopal Barkatullah University,

More information

Analysis of Algorithms I: Optimal Binary Search Trees

Analysis of Algorithms I: Optimal Binary Search Trees Analysis of Algorithms I: Optimal Binary Search Trees Xi Chen Columbia University Given a set of n keys K = {k 1,..., k n } in sorted order: k 1 < k 2 < < k n we wish to build an optimal binary search

More information

Computer Aided Translation

Computer Aided Translation Computer Aided Translation Philipp Koehn 30 April 2015 Why Machine Translation? 1 Assimilation reader initiates translation, wants to know content user is tolerant of inferior quality focus of majority

More information

4 5 6 7 8 9 10 11 What is a character acte set? Definition Usage A character encoding or character set (sometimes referred to as code page) consists of a code that pairs a sequence of characters from a

More information

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints

More information

Scalable Prefix Matching for Internet Packet Forwarding

Scalable Prefix Matching for Internet Packet Forwarding Scalable Prefix Matching for Internet Packet Forwarding Marcel Waldvogel Computer Engineering and Networks Laboratory Institut für Technische Informatik und Kommunikationsnetze Background Internet growth

More information