SRC Research. d i g i t a l. A Block-sorting Lossless Data Compression Algorithm. Report 124. M. Burrows and D.J. Wheeler.
|
|
|
- Colleen Carr
- 10 years ago
- Views:
Transcription
1 May 10, 1994 SRC Research Report 124 A Block-sorting Lossless Data Compression Algorithm M. Burrows and D.J. Wheeler d i g i t a l Systems Research Center 130 Lytton Avenue Palo Alto, California 94301
2 Systems Research Center The charter of SRC is to advance both the state of knowledge and the state of the art in computer systems. From our establishment in 1984, we have performed basic and applied research to support Digital s business objectives. Our current work includes exploring distributed personal computing on multiple platforms, networking, programming technology, system modelling and management techniques, and selected applications. Our strategy is to test the technical and practical value of our ideas by building hardware and software prototypes and using them as daily tools. Interesting systems are too complex to be evaluated solely in the abstract; extended use allows us to investigate their properties in depth. This experience is useful in the short term in refining our designs, and invaluable in the long term in advancing our knowledge. Most of the major advances in information systems have come through this strategy, including personal computing, distributed systems, and the Internet. We also perform complementary work of a more mathematical flavor. Some of it is in established fields of theoretical computer science, such as the analysis of algorithms, computational geometry, and logics of programming. Other work explores new ground motivated by problems that arise in our systems research. We have a strong commitment to communicating our results; exposing and testing our ideas in the research and development communities leads to improved understanding. Our research report series supplements publication in professional journals and conferences. We seek users for our prototype systems among those with whom we have common interests, and we encourage collaboration with university researchers. Robert W. Taylor, Director
3 A Block-sorting Lossless Data Compression Algorithm M. Burrows and D.J. Wheeler May 10, 1994
4 David Wheeler is a Professor of Computer Science at the University of Cambridge, U.K. His electronic mail address is: [email protected] cdigital Equipment Corporation This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of the Systems Research Center of Digital Equipment Corporation in Palo Alto, California; an acknowledgment of the authors and individual contributors to the work; and all applicable portions of the copyright notice. Copying, reproducing, or republishing for any other purpose shall require a license with payment of fee to the Systems Research Center. All rights reserved.
5 Authors abstract We describe a block-sorting, lossless data compression algorithm, and our implementation of that algorithm. We compare the performance of our implementation with widely available data compressors running on the same hardware. The algorithm works by applying a reversible transformation to a block of input text. The transformation does not itself compress the data, but reorders it to make it easy to compress with simple algorithms such as move-to-front coding. Our algorithm achieves speed comparable to algorithms based on the techniques of Lempel and Ziv, but obtains compression close to the best statistical modelling techniques. The size of the input block must be large (a few kilobytes) to achieve good compression.
6 Contents 1 Introduction 1 2 The reversible transformation 2 3 Why the transformed string compresses well 5 4 An efficient implementation Compression :::::::::::::::::::::::::::: Decompression ::::::::::::::::::::::::::: 12 5 Algorithm variants 13 6 Performance of implementation 15 7 Conclusions 16
7 1 Introduction The most widely used data compression algorithms are based on the sequential data compressors of Lempel and Ziv [1, 2]. Statistical modelling techniques may produce superior compression [3], but are significantly slower. In this paper, we present a technique that achieves compression within a percent or so of that achieved by statistical modelling techniques, but at speeds comparable to those of algorithms based on Lempel and Ziv s. Our algorithm does not process its input sequentially, but instead processes a block of text as a single unit. The idea is to apply a reversible transformation to a block of text to form a new block that contains the same characters, but is easier to compress by simple compression algorithms. The transformation tends to group characters together so that the probability of finding a character close to another instance of the same character is increased substantially. Text of this kind can easily be compressed with fast locally-adaptive algorithms, such as move-to-front coding [4] in combination with Huffman or arithmetic coding. Briefly, our algorithmtransforms a string S of N characters by forming the N rotations (cyclic shifts) of S, sorting them lexicographically, and extracting the last character of each of the rotations. A string L is formed from these characters, where the ith character of L is the last character of the ith sorted rotation. In additionto L, the algorithm computes the index I of the original string S in the sorted list of rotations. Surprisingly, there is an efficient algorithm to compute the original string S given only L and I. The sorting operation brings together rotations with the same initial characters. Since the initial characters of the rotations are adjacent to the final characters, consecutive characters in L are adjacent to similar strings in S. If the context of a character is a good predictor for the character, L will be easy to compress with a simple locally-adaptive compression algorithm. In the following sections, we describe the transformation in more detail, and show that it can be inverted. We explain more carefully why this transformation tends to group characters to allow a simple compression algorithm to work more effectively. We then describe efficient techniques for implementing the transformation and its inverse, allowing this algorithm to be competitive in speed with Lempel-Ziv-based algorithms, but achieving better compression. Finally, we give the performance of our implementation of this algorithm, and compare it with well-known compression programmes. The algorithm described here was discovered by one of the authors (Wheeler) in 1983 while he was working at AT&T Bell Laboratories, though it has not previously been published. 1
8 2 The reversible transformation This section describes two sub-algorithms. Algorithm C performs the reversible transformation that we apply to a block of text before compressing it, and Algorithm D performs the inverse operation. A later section suggests a method for compressing the transformed block of text. In the description below, we treat strings as vectors whose elements are characters. Algorithm C: Compression transformation This algorithm takes as input a string S of N characters S[0];:::;S[N 1] selected from an ordered alphabet X of characters. To illustratethe technique, we also give a running example, using the string S D abraca, N D 6, and the alphabet X Df 0 a 0 ; 0 b 0 ; 0 c 0 ; 0 r 0 g. C1. [sort rotations] Form a conceptual N ð N matrix M whose elements are characters, and whose rows are the rotations (cyclic shifts) of S, sorted in lexicographical order. At least one of the rows of M contains the original string S. LetIbe the index of the first such row, numbering from zero. In our example, the index I D 1andthematrixMis row 0 aabrac 1 abraca 2 acaabr 3 bracaa 4 caabra 5 racaab C2. [find last characters of rotations] Let the string L be the last column of M, with characters L[0];:::;L[N 1] (equal to M[0; N 1];:::;M[N 1; N 1]). The output of the transformation is the pair.l; I/. In our example, L D caraab andi D1 (from step C1). 2
9 Algorithm D: Decompression transformation We use the example and notation introduced in Algorithm C. Algorithm D uses the output.l; I/ of Algorithm C to reconstruct its input, the string S of length N. D1. [find first characters of rotations] This step calculates the first column F of the matrix M of Algorithm C. This is done by sorting the characters of L to form F. We observe that any column of the matrix M is a permutation of the original string S. Thus, L and F are bothpermutations of S, and therefore of one another. Furthermore, because the rows of M are sorted,and F is the first columnof M, the characters in F are also sorted. In our example, F D aaabcr. D2. [build list of predecessor characters] To assist our explanation, we describe this step in terms of the contents of the matrix M. The reader should remember that the complete matrix is not available to the decompressor; only the strings F, L, and the index I (from the input) are needed by this step. Consider the rows of the matrix M that start with some given character ch. Algorithm C ensured that the rows of matrix M are sorted lexicographically, so the rows that start with ch are ordered lexicographically. We define the matrix M 0 formed by rotating each row of M one character to the right, so for each i D 0;:::;N 1, and each j D 0;:::;N 1, In our example, M and M 0 are: M 0 [i; j] D M[i;.j 1/mod N] row M M 0 0 aabrac caabra 1 abraca aabrac 2 acaabr racaab 3 bracaa abraca 4 caabra acaabr 5 racaab bracaa Like M, each row of M 0 is a rotation of S, and for each row of M there is a corresponding row in M 0. We constructed M 0 from M so that the rows 3
10 of M 0 are sorted lexicographically starting with their second character. So, if we consider only those rows in M 0 that start with a character ch, they must appear in lexicographical order relative to one another; they have been sorted lexicographically starting with their second characters, and their first characters are all the same and so do not affect the sort order. Therefore, for any given character ch, therowsinmthat begin with ch appear in the same order as the rows in M 0 that begin with ch. In our example, this is demonstrated by the rows that begin with a. The rows aabrac, abraca, and acaabr are rows 0, 1, 2 in M and correspond to rows 1, 3, 4 in M 0. Using F and L, thefirstcolumnsofmand M 0 respectively, we calculate a vector T that indicates the correspondence between the rows of the two matrices, in the sense that for each j D 0;:::;N 1, row j of M 0 corresponds to row T [ j]ofm. If L[ j] isthekth instance of ch in L, thent[j]diwhere F[i] isthekth instance of ch in F. Note that T represents a one-to-one correspondence between elements of F and elements of L, andf[t[ j]] D L[ j]. In our example, T is: ( ). D3. [form output S] Now, for each i D 0;:::;N 1, the characters L[i] andf[i]arethelast and first characters of the row i of M. Since each row is a rotation of S, the character L[i] cyclicly precedes the character F[i] ins. Fromthe construction of T,wehaveF[T[j]] D L[ j]. Substituting i D T[ j], we have L[T[ j]] cyclicly precedes L[ j]ins. The index I is defined by Algorithm C such that row I of M is S. Thus, the last character of S is L[I]. We use the vector T to give the predecessors of each character: for each i D 0;:::;N 1: S[N 1 i] D L[T i [I]]. where T 0 [x] D x,andt ic1 [x]dt[t i [x]]. This yields S, the original input to the compressor. In our example, S D abraca. We could have defined T so that the string S would be generated from front to back, rather than the other way around. The description above matches the pseudo-code giveninsection4.2. 4
11 The sequence T i [I] forid0;:::;n 1 is not necessarily a permutation of the numbers 0;:::;N 1. If the original string S is of the form Z p for some substring Z and some p > 1, then the sequence T i [I] fori D0;:::;N 1 will also be of the form Z 0p for some subsequence Z 0. That is, the repetitions in S will be generated by visiting the same elements of T repeatedly. For example, if S D cancan, Z D can andpd2, the sequence T i [I]foriD0;:::;N 1 will be [2; 4; 0; 2; 4; 0]. 3 Why the transformed string compresses well Algorithm C sorts the rotations of an input string S, and generates the string L consisting of the last character of each rotation. To see why this might lead to effective compression, consider the effect on a single letter in a common word in a block of English text. We will use the example of the letter t intheword the, and assume an input string containing many instances of the. When the list of rotations of the input is sorted, all the rotations starting with he will sort together; a large proportion of them are likely to end in t. One region of the string L will therefore contain a disproportionatelylarge number of t characters, intermingled with other characters that can proceed he in English, such as space, s, T, and S. The same argument can be applied to all characters in all words, so any localized region of the string L is likely to contain a large number of a few distinct characters. The overall effect is that the probability that given character ch will occur at a given point in L is very high if ch occurs near that point in L, and is low otherwise. This property is exactly the one needed for effective compression by a move-tofront coder [4], which encodes an instance of character ch by the count of distinct characters seen since the next previous occurrence of ch. Whenappliedtothe string L, the output of a move-to-front coder will be dominated by low numbers, which can be efficiently encoded with a Huffman or arithmetic coder. Figure 1 shows an example of the algorithm at work. Each line is the first few characters and final character of a rotation of a version of this document. Note the grouping of similar characters in the column of final characters. For completeness, we now give details of one possible way to encode the output of Algorithm C, and the corresponding inverse operation. A complete compression algorithm is created by combining these encoding and decoding operations with Algorithms C and D. 5
12 final char sorted rotations (L) a n to decompress. It achieves compression o n to perform only comparisons to a depth o n transformation} This section describes o n transformation} We use the example and o n treats the right-hand side as the most a n tree for each 16 kbyte input block, enc a n tree in the output stream, then encodes i n turn, set $L[i]$ to be the i n turn, set $R[i]$ to the o n unusual data. Like the algorithm of Man a n use a single set of probabilities table e n using the positions of the suffixes in i n value at a given point in the vector $R e n we present modifications that improve t e n when the block size is quite large. Ho i n which codes that have not been seen in i n with $ch$ appear in the {\em same order i n with $ch$. In our exam o n with Huffman or arithmetic coding. Bri o n with figures given by Bell \cite{bell}. Figure 1: Example of sorted rotations. Twenty consecutive rotations from the sorted list of rotations of a version of this paper are shown, together with the final character of each rotation. 6
13 Algorithm M: Move-to-front coding This algorithm encodes the output.l; I/ of Algorithm C, where L is a string of length N and I is an index. It encodes L using a move-to-front algorithm and a Huffman or arithmetic coder. The running example of Algorithm C is continued here. M1. [move-to-front coding] This step encodes each of the characters in L by applying the move-tofront technique to the individual characters. We define a vector of integers R[0];:::;R[N 1], which are the codes for the characters L[0];:::;L[N 1]. Initialize a list Y of characters to contain each character in the alphabet X exactly once. For each i D 0;:::;N 1 in turn, set R[i] to the number of characters preceding character L[i] in the list Y, then move character L[i] to the front of Y. Taking Y D [ 0 a 0 ; 0 b 0 ; 0 c 0 ; 0 r 0 ] initially, and L D caraab, we compute the vector R: ( ). M2. [encode] Apply Huffman or arithmetic coding to the elements of R, treating each element as a separate token to be coded. Any coding technique can be applied as long as the decompressor can perform the inverse operation. Call the output of this coding process OUT. The output of Algorithm C is the pair.out; I/ where I is the value computed in step C1. Algorithm W: Move-to-front decoding This algorithm is the inverse of Algorithm M. It computes the pair.l; I/ from the pair.out; I/. We assume that the initial value of the list Y used in step M1 is available to the decompressor, and that the coding scheme used in step M2 has an inverse operation. W1. [decode] Decode the coded stream OUT using the inverse of the coding scheme used in step M2. The result is a vector R of N integers. In our example, R is: ( ). 7
14 W2. [inverse move-to-front coding] The goal of this step is to calculate a string L of N characters, given the move-to-front codes R[0];:::;R[N 1]. Initializea list Y of characters to contain the characters of the alphabet X in the same order as in step M1. For each i D 0;:::;N 1 in turn, set L[i] to be the character at position R[i] in list Y (numbering from 0), then move that character to the front of Y. The resulting string L is the last column of matrix M of Algorithm C. The output of this algorithm is the pair.l; I/, which is the input to Algorithm D. Taking Y D [ 0 a 0 ; 0 b 0 ; 0 c 0 ; 0 r 0 ] initially (as in Algorithm M), we compute the string L D caraab. 4 An efficient implementation Efficient implementations of move-to-front, Huffman, and arithmetic coding are well known. Here we concentrate on the unusual steps in Algorithms C and D. 4.1 Compression The most important factor in compression speed is the time taken to sort the rotations of the input block. A simple application of quicksort requires little additional space (one word per character), and works fairly well on most input data. However, its worst-case performance is poor. A faster way to implement Algorithm C is to reduce the problem of sorting the rotations to the problem of sorting the suffixes of a related string. To compress a string S, first form the string S 0, which is the concatenation of S with EOF, a new character that does not appear in S. Now apply Algorithm C to S 0. Because EOF is different from all other characters in S 0,thesuffixesofS 0 sort in the same order as the rotations of S 0. Hence, to sort the rotations, it suffices to sort the suffixes of S 0. This can be done in linear time and space by building a suffix tree, which then can be walked in lexicographical order to recover the sorted suffixes. We used McCreight s suffix tree construction algorithm [5] in an implementation of Algorithm C. Its performance is within 40% of the fastest technique we have found when operating on text files. Unfortunately, suffix tree algorithms need space for more than four machine words per input character. Manber and Myers give a simple algorithm for sorting the suffixes of a string in O.N log N/ time [6]. The algorithm they describe requires only two words per 8
15 input character. The algorithm works by sorting suffixes on their first i characters, then using the positions of the suffixes in the sorted array as the sort key for another sort on the first 2i characters. Unfortunately, the performance of their algorithm is significantly worse than the suffix tree approach. The fastest scheme we have tried so far uses a variant of quicksort to generate the sorted list of suffixes. Its performance is significantly better than the suffix tree when operating on text, and it uses significantly less space. Unlike the suffix tree however, its performance can degrade considerably when operating on some types of data. Like the algorithm of Manber and Myers, our algorithm uses only two words per input character. Algorithm Q: Fast quicksorting on suffixes This algorithm sorts the suffixes of the string S, which contains N characters S[0;:::;N 1]. The algorithm works by applying a modified version of quicksort to the suffixes of S. First we present a simple version of the algorithm. Then we present modifications that improve the speed and make bad performance less likely. Q1. [form extended string] Let k be the number of characters that fit in a machine word. Form the string S 0 from S by appending k additional EOF characters to S, where EOF does not appear in S. Q2. [form array of words] Initialize an array W of N words W[0;:::;N 1], such that W[i] contains the characters S 0 [i;:::;i Ck 1] arranged so that integer comparisons on the words agree with lexicographic comparisons on the k-character strings. Packing characters into words has two benefits: It allows two prefixes to be compared k bytes at a time using aligned memory accesses, and it allows many slow cases to be eliminated (described in step Q7). Q3. [form array of suffixes] In this step we initialize an array V of N integers. If an element of V contains j, it represents the suffix of S 0 whose first character is S 0 [ j]. When the algorithm is complete, suffix V[i] will be the ith suffix in lexicographical order. Initialize an array V of integers so that for each i D 0;:::;N 1:V[i]Di. 9
16 Q4. [radix sort] Sort the elements of V, using the first two characters of each suffix as the sort key. This can be done efficiently using radix sort. Q5. [iterate over each character in the alphabet] For each character ch in the alphabet, perform steps Q6, Q7. Once this iteration is complete, V represents the sorted suffixes of S,andthe algorithm terminates. Q6. [quicksort suffixes starting with ch] For each character ch 0 in the alphabet: Apply quicksort to the elements of V starting with ch followed by ch 0. In the implementation of quicksort,compare the elements of V by comparing the suffixes they represent by indexing into the array W. At each recursion step, keep track of the number of characters that have compared equal in the group, so that they need not be compared again. All the suffixes starting with ch have now been sorted into their final positions in V. Q7. [update sort keys] For each element V[i] corresponding to a suffix starting with ch (that is, for which S[V [i]] D ch), set W[V[i]] to a value with ch in its high-order bits (as before) and with i in its low-order bits (which step Q2 set to the k 1 characters followingthe initial ch). The new value of W[V[i]] sorts into the same position as the old value, but has the desirable property that it is distinct from all other values in W. This speeds up subsequent sorting operations, since comparisons with the new elements cannot compare equal. This basic algorithm can be refined in a number of ways. An obvious improvement is to pick the character ch in step Q5 starting with the least common character in S, and proceeding to the most common. This allows the updates of step Q7 to have the greatest effect. As described, the algorithm performs poorly when S contains many repeated substrings. We deal with this problem with two mechanisms. The first mechanism handles strings of a single repeated character by replacing step Q6 with the following steps. Q6a. [quicksort suffixes starting ch; ch 0 where ch 6D ch 0 ] For each character ch 0 6D ch in the alphabet: Apply quicksort to the elements of V starting with ch followed by ch 0. In the implementation of quicksort, 10
17 compare the elements of V by comparing the suffixes they represent by indexing into the array W. At each recursion step, keep track of the number of characters that have compared equal in the group, so that they need not be compared again. Q6b. [sort low suffixes starting ch; ch] This step sorts the suffixes starting runs which would sort before an infinite string of ch characters. We observe that among such suffixes, long initial runs of character ch sort after shorter initial runs, and runs of equal length sort in the same order as the characters at the ends of the run. This ordering can be obtained with a simple loop over the suffixes, starting with the shortest. Let V[i] represent the lexicographically least suffix starting with ch. Letj be the lowest value such that suffix V[ j] starts with ch; ch. while not (i = j) do if (V[i] > 0) and (S[V[i]-1] = ch) then V[j] := V[i]-1; j := j + 1 end; i := i + 1 end; All the suffixes that start with ch; ch and that are lexicographically less than an infinitesequence of ch characters are now sorted. Q6c. [sort high suffixes starting ch; ch] This step sorts the suffixes starting runs which would sort after an infinite string of ch characters. This step is very like step Q6b, but with the direction of the loop reversed. Let V[i] represent the lexicographically greatest suffix starting with ch. Let jbe the highest value such that suffix V[ j] starts with ch; ch. while not (i = j) do if (V[i] > 0) and (S[V[i]-1] = ch) then V[j] := V[i]-1; j := j - 1 end; i := i - 1 end; All the suffixes that start with ch; ch and that are lexicographically greater than an infinite sequence of ch characters are now sorted. 11
18 The second mechanism handles long strings of a repeated substring of more than one character. For such strings, we use a doubling technique similar to that used in Manber and Myers scheme. We limit our quicksort implementation to perform comparisons only to a depth D. We record where suffixes have been left unsorted with a bit in the corresponding elements of V. Once steps Q1 to Q7 are complete, the elements of the array W indicate sorted positionsup to D ð k characters, since each comparison compares k characters. If unsorted portions of V remain, we simply sort them again, limiting the comparison depth to D as before. This time, each comparison compares D ð k characters, to yield a list of suffixes sorted on their first D 2 ð k characters. We repeat this process until the entire string is sorted. The combination of radix sort, a careful implementation of quicksort, and the mechanisms for dealing with repeated strings produce a sorting algorithm that is extremely fast on most inputs, and is quite unlikely to perform very badly on real input data. We are currently investigating further variations of quicksort. The following approach seems promising. By applying the technique of Q6a Q6c to all overlapping repeated strings, and by caching previously computed sort orders in a hash table, we can produce an algorithm that sorts the suffixes of a string in approximately the same time as the modified algorithm Q, but using only 6 bytes of space per input character (4 bytes for a pointer, 1 byte for the input character itself, and 1 byte amortized space in the hash table). This approach has poor performance in the worst case, but bad cases seem to be rare in practice. 4.2 Decompression Steps D1 and D2 can be accomplished efficiently with only two passes over the data, and one pass over the alphabet, as shown in the pseudo-code below. In the first pass, we construct two arrays: C[alphabet]andP[0;:::;N 1]. C[ch]isthe total number of instances in L of characters preceding character ch in the alphabet. P[i] is the number of instances of character L[i] in the prefix L[0;:::;i 1] of L. In practice, this first pass could be combined with the move-to-front decoding step. Given the arrays L, C, P, the array T defined in step D2 is given by: for each i D 0;:::;N 1:T[i]DP[i]CC[L[i]] In a second pass over the data, we use the starting position I to complete Algorithm D to generate S. Initially, all elements of C are zero, and the last column of matrix M is the vector L[0;:::;N 1]. 12
19 for i := 0 to N-1 do P[i] := C[L[i]]; C[L[i]] := C[L[i]] + 1 end; Now C[ch] is the number of instances in L of character ch. ThevalueP[i]isthe number of instances of character L[i] in the prefix L[0;:::;i 1] of L. sum := 0; for ch := FIRST(alphabet) to LAST(alphabet) do sum := sum + C[ch]; C[ch] := sum - C[ch]; end; Now C[ch] is the total number of instances in L of characters preceding ch in the alphabet. i := I; for j := N-1 downto 0 do S[j] := L[i]; i := P[i] + C[L[i]] end The decompressed result is now S[0;:::;N 1]. 5 Algorithm variants In the example given in step C1, we treated the left-hand side of each rotation as the most significant when sorting. In fact, our implementation treats the right-hand side as the most significant, so that the decompressor will generate its output S from left to right using the code of Section 4.2. This choice has almost no effect on the compression achieved. The algorithm can be modified to use a different coding scheme instead of move-to-front in step M1. Compression can improve slightly if move-to-front is replaced by a scheme in which codes that have not been seen in a very long time are moved only part-way to the front. We have not exploited this in our implementations. Although simple Huffman and arithmetic coders do well in step M2, a more complex coding scheme can do slightly better. This is because the probability of a certain value at a given point in the vector R depends to a certain extent on the immediately preceding value. In practice, the most important effect is that zeroes tend to occur in runs in R. We can take advantage of this effect by representing each run of zeroes by a code indicating the length of the run. A second set of Huffman codes can be used for values immediately following a run of zeroes, since the next value cannot be another run of zeroes. 13
20 Size CPU time/s Compressed bits/ File (bytes) compress decompress size (bytes) char bib book book geo news obj obj paper paper pic progc progl progp trans Total Table 1: Results of compressing fourteen files of the Calgary Compression Corpus. Block size bits/character (bytes) book1 Hector corpus 1k k k k k k M M M M M Table 2: The effect of varying block size (N) on compression of book1 from the Calgary Compression Corpus and of the entire Hector corpus. 14
21 6 Performance of implementation In Table 1 we give the results of compressing the fourteen commonly used files of the Calgary Compression Corpus [7] with our algorithm. Comparison of these figures with those given by Bell, Witten and Cleary [3] indicate that our algorithm does well on non-text inputs as well as text inputs. Our implementation of Algorithm C uses the techniques described in Section 4.1, and a simple move-to-front coder. In each case, the block size N is the length of the file being compressed. We compress the output of the move-to-front coder with a modified Huffman coder that uses the technique described in Section 5. Our coder calculates new Huffman trees for each 16 kbyte input block, rather than computing one tree for its whole input. In Table 1, compression effectiveness is expressed as output bits per input character. The CPU time measurements were made on a DECstation 5000/200, which has a MIPS R3000 processor clocked at 25MHz with a 64 kbyte cache. CPU time is given rather than elapsed time so the time spent performing I/O is excluded. In Table 2, we show the variation of compression effectiveness with input block size for two inputs. The first input is the file book1 from the Calgary Compression Corpus. The second is the entire Hector corpus [8], which consists of a little over 100 MBytes of modern English text. The table shows that compression improves with increasing block size, even when the block size is quite large. However, increasing the block size above a few tens of millions of characters has only a small effect. If the block size were increased indefinitely, we expect that the algorithm would not approach the theoretical optimum compression because our simple byte-wise move-to-front coder introduces some loss. A better coding scheme might achieve optimum compression asymptotically, but this is not likely to be of great practical importance. Comparison with other compression programmes Table 3 compares our implementation of Algorithm C with three other compression programmes, chosen for their wide availability. The same fourteen files were compressed individually with each algorithm, and the results totalled. The bits per character values are the means of the values for the individual files. This metric was chosen to allow easy comparison with figures given by Bell [3]. 15
22 Total CPU time/s Total compressed mean Programme compress decompress size (bytes) bits/char compress gzip Alg-C/D comp compress is version of the well-known LZW-based tool [9, 10]. gzip is version of Gailly s LZ77-based tool [1, 11]. Alg-C/D is Algorithms C and D, with our back-end Huffman coder. comp-2 is Nelson s comp-2 coder, limited to a fourth-order model [12]. 7 Conclusions Table 3: Comparison with other compression programmes. We have described a compression technique that works by applying a reversible transformation to a block of text to make redundancy in the input more accessible to simple coding schemes. Our algorithm is general-purpose, in that it does well on both text and non-text inputs. The transformation uses sorting to group characters together based on their contexts; this technique makes use of the context on only one side of each character. To achieve good compression, input blocks of several thousand characters are needed. The effectiveness of the algorithm continues to improve with increasing block size at least up to blocks of several million characters. Our algorithm achieves compression comparable with good statistical modellers, yet is closer in speed to coders based on the algorithms of Lempel and Ziv. Like Lempel and Ziv s algorithms, our algorithm decompresses faster than it compresses. Acknowledgements We would like to thank Andrei Broder, Cynthia Hibbard, Greg Nelson, and Jim Saxe for their suggestions on the algorithms and the paper. 16
23 References [1] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory. Vol. IT-23, No. 3, May 1977, pp [2] J. Ziv and A. Lempel. Compression of individual sequences via variable rate coding. IEEE Transactions on Information Theory. Vol. IT-24, No. 5, September 1978, pp [3] T. Bell, I. H. Witten, and J.G. Cleary. Modeling for text compression. ACM Computing Surveys, Vol. 21, No. 4, December 1989, pp [4] J.L. Bentley, D.D. Sleator, R.E. Tarjan, and V.K. Wei. A locally adaptive data compression algorithm. Communications of the ACM, Vol. 29, No. 4, April 1986, pp [5] E.M. McCreight. A space economical suffix tree construction algorithm. Journal of the ACM, Vol. 32, No. 2, April 1976, pp [6] U. Manber and E. W. Myers, Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, Volume 22, No. 5, October 1993, pp [7] I. H. Witten and T. Bell. The Calgary/Canterbury text compression corpus. Anonymous ftp from ftp.cpsc.ucalgary.ca: /pub/text.compression.corpus/text.compression.corpus.tar.z [8] L. Glassman, D. Grinberg, C. Hibbard, J. Meehan, L. Guarino Reid, and M-C. van Leunen. Hector: Connecting Words with Definitions. Proceedings of the 8th Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, Waterloo, Canada, October, pp Also available as Research Report 92a, Digital Equipment Corporation Systems Research Center, 130 Lytton Ave, Palo Alto, CA [9] T.A. Welch. A technique for high performance data compression. IEEE Computer, Vol. 17, No. 6, June 1984, pp [10] Peter Jannesen et al. Compress, Version Posted to the Internet newsgroup comp.sources.reviewed, 28th August, Anonymous ftp from gatekeeper.dec.com: /pub/misc/ncompress
24 [11] J. Gailly et al. Gzip,Version Anonymous ftp from prep.ai.mit.edu: /pub/gnu/gzip tar.gz [12] Mark Nelson. Arithmetic coding and statistical modeling. Dr. Dobbs Journal. February, Anonymous ftp from wuarchive.wustl.edu: /mirrors/msdos/ddjmag/ddj9102.zip 18
Analysis of Compression Algorithms for Program Data
Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory
Storage Optimization in Cloud Environment using Compression Algorithm
Storage Optimization in Cloud Environment using Compression Algorithm K.Govinda 1, Yuvaraj Kumar 2 1 School of Computing Science and Engineering, VIT University, Vellore, India [email protected] 2 School
Image Compression through DCT and Huffman Coding Technique
International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul
LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b
LZ77 The original LZ77 algorithm works as follows: A phrase T j starting at a position i is encoded as a triple of the form distance, length, symbol. A triple d, l, s means that: T j = T [i...i + l] =
Searching BWT compressed text with the Boyer-Moore algorithm and binary search
Searching BWT compressed text with the Boyer-Moore algorithm and binary search Tim Bell 1 Matt Powell 1 Amar Mukherjee 2 Don Adjeroh 3 November 2001 Abstract: This paper explores two techniques for on-line
Data Compression Using Long Common Strings
Data Compression Using Long Common Strings Jon Bentley Bell Labs, Room 2C-514 600 Mountain Avenue Murray Hill, NJ 07974 [email protected] Douglas McIlroy Department of Computer Science Dartmouth
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
A Catalogue of the Steiner Triple Systems of Order 19
A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of
Fast Arithmetic Coding (FastAC) Implementations
Fast Arithmetic Coding (FastAC) Implementations Amir Said 1 Introduction This document describes our fast implementations of arithmetic coding, which achieve optimal compression and higher throughput by
Compression techniques
Compression techniques David Bařina February 22, 2013 David Bařina Compression techniques February 22, 2013 1 / 37 Contents 1 Terminology 2 Simple techniques 3 Entropy coding 4 Dictionary methods 5 Conclusion
On the Use of Compression Algorithms for Network Traffic Classification
On the Use of for Network Traffic Classification Christian CALLEGARI Department of Information Ingeneering University of Pisa 23 September 2008 COST-TMA Meeting Samos, Greece Outline Outline 1 Introduction
Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran
Wan Accelerators: Optimizing Network Traffic with Compression Bartosz Agas, Marvin Germar & Christopher Tran Introduction A WAN accelerator is an appliance that can maximize the services of a point-to-point(ptp)
A Dynamic Programming Approach for Generating N-ary Reflected Gray Code List
A Dynamic Programming Approach for Generating N-ary Reflected Gray Code List Mehmet Kurt 1, Can Atilgan 2, Murat Ersen Berberler 3 1 Izmir University, Department of Mathematics and Computer Science, Izmir
HIGH DENSITY DATA STORAGE IN DNA USING AN EFFICIENT MESSAGE ENCODING SCHEME Rahul Vishwakarma 1 and Newsha Amiri 2
HIGH DENSITY DATA STORAGE IN DNA USING AN EFFICIENT MESSAGE ENCODING SCHEME Rahul Vishwakarma 1 and Newsha Amiri 2 1 Tata Consultancy Services, India [email protected] 2 Bangalore University, India ABSTRACT
Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework
Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework Sergio De Agostino Computer Science Department Sapienza University of Rome Internet as a Distributed System Modern
Lempel-Ziv Factorization: LZ77 without Window
Lempel-Ziv Factorization: LZ77 without Window Enno Ohlebusch May 13, 2016 1 Sux arrays To construct the sux array of a string S boils down to sorting all suxes of S in lexicographic order (also known as
Operation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
Hybrid Lossless Compression Method For Binary Images
M.F. TALU AND İ. TÜRKOĞLU/ IU-JEEE Vol. 11(2), (2011), 1399-1405 Hybrid Lossless Compression Method For Binary Images M. Fatih TALU, İbrahim TÜRKOĞLU Inonu University, Dept. of Computer Engineering, Engineering
HY345 Operating Systems
HY345 Operating Systems Recitation 2 - Memory Management Solutions Panagiotis Papadopoulos [email protected] Problem 7 Consider the following C program: int X[N]; int step = M; //M is some predefined constant
Record Storage and Primary File Organization
Record Storage and Primary File Organization 1 C H A P T E R 4 Contents Introduction Secondary Storage Devices Buffering of Blocks Placing File Records on Disk Operations on Files Files of Unordered Records
Theoretical Aspects of Storage Systems Autumn 2009
Theoretical Aspects of Storage Systems Autumn 2009 Chapter 3: Data Deduplication André Brinkmann News Outline Data Deduplication Compare-by-hash strategies Delta-encoding based strategies Measurements
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
Lossless Grey-scale Image Compression using Source Symbols Reduction and Huffman Coding
Lossless Grey-scale Image Compression using Source Symbols Reduction and Huffman Coding C. SARAVANAN [email protected] Assistant Professor, Computer Centre, National Institute of Technology, Durgapur,WestBengal,
Arithmetic Coding: Introduction
Data Compression Arithmetic coding Arithmetic Coding: Introduction Allows using fractional parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation
Integer Factorization using the Quadratic Sieve
Integer Factorization using the Quadratic Sieve Chad Seibert* Division of Science and Mathematics University of Minnesota, Morris Morris, MN 56567 [email protected] March 16, 2011 Abstract We give
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Linear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
Table Lookups: From IF-THEN to Key-Indexing
Paper 158-26 Table Lookups: From IF-THEN to Key-Indexing Arthur L. Carpenter, California Occidental Consultants ABSTRACT One of the more commonly needed operations within SAS programming is to determine
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
An Evaluation of Self-adjusting Binary Search Tree Techniques
SOFTWARE PRACTICE AND EXPERIENCE, VOL. 23(4), 369 382 (APRIL 1993) An Evaluation of Self-adjusting Binary Search Tree Techniques jim bell and gopal gupta Department of Computer Science, James Cook University,
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints
Information, Entropy, and Coding
Chapter 8 Information, Entropy, and Coding 8. The Need for Data Compression To motivate the material in this chapter, we first consider various data sources and some estimates for the amount of data associated
How To Encrypt Data With A Power Of N On A K Disk
Towards High Security and Fault Tolerant Dispersed Storage System with Optimized Information Dispersal Algorithm I Hrishikesh Lahkar, II Manjunath C R I,II Jain University, School of Engineering and Technology,
New Hash Function Construction for Textual and Geometric Data Retrieval
Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
HP Smart Array Controllers and basic RAID performance factors
Technical white paper HP Smart Array Controllers and basic RAID performance factors Technology brief Table of contents Abstract 2 Benefits of drive arrays 2 Factors that affect performance 2 HP Smart Array
Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search
Chapter Objectives Chapter 9 Search Algorithms Data Structures Using C++ 1 Learn the various search algorithms Explore how to implement the sequential and binary search algorithms Discover how the sequential
Extended Application of Suffix Trees to Data Compression
Extended Application of Suffix Trees to Data Compression N. Jesper Larsson A practical scheme for maintaining an index for a sliding window in optimal time and space, by use of a suffix tree, is presented.
Dynamic Programming. Lecture 11. 11.1 Overview. 11.2 Introduction
Lecture 11 Dynamic Programming 11.1 Overview Dynamic Programming is a powerful technique that allows one to solve many different types of problems in time O(n 2 ) or O(n 3 ) for which a naive approach
ML for the Working Programmer
ML for the Working Programmer 2nd edition Lawrence C. Paulson University of Cambridge CAMBRIDGE UNIVERSITY PRESS CONTENTS Preface to the Second Edition Preface xiii xv 1 Standard ML 1 Functional Programming
Data Storage - II: Efficient Usage & Errors
Data Storage - II: Efficient Usage & Errors Week 10, Spring 2005 Updated by M. Naci Akkøk, 27.02.2004, 03.03.2005 based upon slides by Pål Halvorsen, 12.3.2002. Contains slides from: Hector Garcia-Molina
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
An On-Line Algorithm for Checkpoint Placement
An On-Line Algorithm for Checkpoint Placement Avi Ziv IBM Israel, Science and Technology Center MATAM - Advanced Technology Center Haifa 3905, Israel [email protected] Jehoshua Bruck California Institute
The Relative Worst Order Ratio for On-Line Algorithms
The Relative Worst Order Ratio for On-Line Algorithms Joan Boyar 1 and Lene M. Favrholdt 2 1 Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, [email protected]
CHAPTER 2 LITERATURE REVIEW
11 CHAPTER 2 LITERATURE REVIEW 2.1 INTRODUCTION Image compression is mainly used to reduce storage space, transmission time and bandwidth requirements. In the subsequent sections of this chapter, general
Physical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
Streaming Lossless Data Compression Algorithm (SLDC)
Standard ECMA-321 June 2001 Standardizing Information and Communication Systems Streaming Lossless Data Compression Algorithm (SLDC) Phone: +41 22 849.60.00 - Fax: +41 22 849.60.01 - URL: http://www.ecma.ch
Storage Management for Files of Dynamic Records
Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. [email protected] Alistair Moffat Department of Computer Science
Memory Systems. Static Random Access Memory (SRAM) Cell
Memory Systems This chapter begins the discussion of memory systems from the implementation of a single bit. The architecture of memory chips is then constructed using arrays of bit implementations coupled
Developing and Investigation of a New Technique Combining Message Authentication and Encryption
Developing and Investigation of a New Technique Combining Message Authentication and Encryption Eyas El-Qawasmeh and Saleem Masadeh Computer Science Dept. Jordan University for Science and Technology P.O.
4.2 Sorting and Searching
Sequential Search: Java Implementation 4.2 Sorting and Searching Scan through array, looking for key. search hit: return array index search miss: return -1 public static int search(string key, String[]
FCE: A Fast Content Expression for Server-based Computing
FCE: A Fast Content Expression for Server-based Computing Qiao Li Mentor Graphics Corporation 11 Ridder Park Drive San Jose, CA 95131, U.S.A. Email: qiao [email protected] Fei Li Department of Computer Science
Comparing Methods to Identify Defect Reports in a Change Management Database
Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com
IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
50 Computer Science MI-SG-FLD050-02
50 Computer Science MI-SG-FLD050-02 TABLE OF CONTENTS PART 1: General Information About the MTTC Program and Test Preparation OVERVIEW OF THE TESTING PROGRAM... 1-1 Contact Information Test Development
EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada [email protected]
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Original-page small file oriented EXT3 file storage system
Original-page small file oriented EXT3 file storage system Zhang Weizhe, Hui He, Zhang Qizhen School of Computer Science and Technology, Harbin Institute of Technology, Harbin E-mail: [email protected]
A comprehensive survey on various ETC techniques for secure Data transmission
A comprehensive survey on various ETC techniques for secure Data transmission Shaikh Nasreen 1, Prof. Suchita Wankhade 2 1, 2 Department of Computer Engineering 1, 2 Trinity College of Engineering and
Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range
THEORY OF COMPUTING, Volume 1 (2005), pp. 37 46 http://theoryofcomputing.org Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range Andris Ambainis
GUJARAT TECHNOLOGICAL UNIVERSITY, AHMEDABAD, GUJARAT. Course Curriculum. DATA STRUCTURES (Code: 3330704)
GUJARAT TECHNOLOGICAL UNIVERSITY, AHMEDABAD, GUJARAT Course Curriculum DATA STRUCTURES (Code: 3330704) Diploma Programme in which this course is offered Semester in which offered Computer Engineering,
Lecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
1 Solving LPs: The Simplex Algorithm of George Dantzig
Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.
RN-Codings: New Insights and Some Applications
RN-Codings: New Insights and Some Applications Abstract During any composite computation there is a constant need for rounding intermediate results before they can participate in further processing. Recently
Types of Workloads. Raj Jain. Washington University in St. Louis
Types of Workloads Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 [email protected] These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/ 4-1 Overview!
Performance Tuning for the Teradata Database
Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document
Real-Time Software. Basic Scheduling and Response-Time Analysis. René Rydhof Hansen. 21. september 2010
Real-Time Software Basic Scheduling and Response-Time Analysis René Rydhof Hansen 21. september 2010 TSW (2010e) (Lecture 05) Real-Time Software 21. september 2010 1 / 28 Last Time Time in a real-time
Computer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 11 Memory Management Computer Architecture Part 11 page 1 of 44 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin
Storing Measurement Data
Storing Measurement Data File I/O records or reads data in a file. A typical file I/O operation involves the following process. 1. Create or open a file. Indicate where an existing file resides or where
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs
Data Reduction: Deduplication and Compression Danny Harnik IBM Haifa Research Labs Motivation Reducing the amount of data is a desirable goal Data reduction: an attempt to compress the huge amounts of
Tape Drive Data Compression Q & A
Tape Drive Data Compression Q & A Question What is data compression and how does compression work? Data compression permits increased storage capacities by using a mathematical algorithm that reduces redundant
A Note on Maximum Independent Sets in Rectangle Intersection Graphs
A Note on Maximum Independent Sets in Rectangle Intersection Graphs Timothy M. Chan School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada [email protected] September 12,
Binary Trees and Huffman Encoding Binary Search Trees
Binary Trees and Huffman Encoding Binary Search Trees Computer Science E119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Motivation: Maintaining a Sorted Collection of Data A data dictionary
Fast string matching
Fast string matching This exposition is based on earlier versions of this lecture and the following sources, which are all recommended reading: Shift-And/Shift-Or 1. Flexible Pattern Matching in Strings,
System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1
System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect
Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.
Some Computer Organizations and Their Effectiveness Michael J Flynn IEEE Transactions on Computers. Vol. c-21, No.9, September 1972 Introduction Attempts to codify a computer have been from three points
How To Encrypt With A 64 Bit Block Cipher
The Data Encryption Standard (DES) As mentioned earlier there are two main types of cryptography in use today - symmetric or secret key cryptography and asymmetric or public key cryptography. Symmetric
Khalid Sayood and Martin C. Rost Department of Electrical Engineering University of Nebraska
PROBLEM STATEMENT A ROBUST COMPRESSION SYSTEM FOR LOW BIT RATE TELEMETRY - TEST RESULTS WITH LUNAR DATA Khalid Sayood and Martin C. Rost Department of Electrical Engineering University of Nebraska The
Check Digits for Detecting Recording Errors in Horticultural Research: Theory and Examples
HORTSCIENCE 40(7):1956 1962. 2005. Check Digits for Detecting Recording Errors in Horticultural Research: Theory and Examples W.R. Okie U.S. Department of Agriculture, Agricultural Research Service, Southeastern
SRC Technical Note. Syntactic Clustering of the Web
SRC Technical Note 1997-015 July 25, 1997 Syntactic Clustering of the Web Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig Systems Research Center 130 Lytton Avenue Palo Alto, CA 94301
TPCalc : a throughput calculator for computer architecture studies
TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University [email protected] [email protected]
Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm
Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm 1. LZ77:Sliding Window Lempel-Ziv Algorithm [gzip, pkzip] Encode a string by finding the longest match anywhere within a window of past symbols
Chapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
Nonlinear Iterative Partial Least Squares Method
Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for
Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY 11794 4400
Lecture 18: Applications of Dynamic Programming Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.sunysb.edu/ skiena Problem of the Day
SMALL INDEX LARGE INDEX (SILT)
Wayne State University ECE 7650: Scalable and Secure Internet Services and Architecture SMALL INDEX LARGE INDEX (SILT) A Memory Efficient High Performance Key Value Store QA REPORT Instructor: Dr. Song
CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92.
Name: Email ID: CSE 326, Data Structures Section: Sample Final Exam Instructions: The exam is closed book, closed notes. Unless otherwise stated, N denotes the number of elements in the data structure
Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication
Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication Table of Contents Introduction... 3 Shortest Possible Backup Window... 3 Instant
Higher Computing Science Course Assessment Specification (C716 76)
Higher Computing Science Course Assessment Specification (C716 76) Valid from August 2014 This edition: June 2015, version 1.3 This specification may be reproduced in whole or in part for educational purposes
Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur
Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur Module No. # 01 Lecture No. # 05 Classic Cryptosystems (Refer Slide Time: 00:42)
