A Catalogue of the Steiner Triple Systems of Order 19



Similar documents
Information, Entropy, and Coding

Fast Arithmetic Coding (FastAC) Implementations

Arithmetic Coding: Introduction

Indexing and Compression of Text

Gambling and Data Compression

Streaming Lossless Data Compression Algorithm (SLDC)

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

Caml Virtual Machine File & data formats Document version: 1.4

This section describes how LabVIEW stores data in memory for controls, indicators, wires, and other objects.

Multi-layer Structure of Data Center Based on Steiner Triple System

Binary search tree with SIMD bandwidth optimization using SSE

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

Raima Database Manager Version 14.0 In-memory Database Engine

Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems

International Journal of Advanced Research in Computer Science and Software Engineering

COMBINATORIAL PROPERTIES OF THE HIGMAN-SIMS GRAPH. 1. Introduction

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

Analysis of Algorithms I: Optimal Binary Search Trees

JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Efficient Prevention of Credit Card Leakage from Enterprise Networks

Binary storage of graphs and related data

k, then n = p2α 1 1 pα k

TECHNICAL UNIVERSITY OF CRETE DATA STRUCTURES FILE STRUCTURES

On the independence number of graphs with maximum degree 3

Multimedia Systems WS 2010/2011

APP INVENTOR. Test Review

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Testing LTL Formula Translation into Büchi Automata

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

Offline sorting buffers on Line

Hadoop and Map-reduce computing

Local periods and binary partial words: An algorithm

A REMARK ON ALMOST MOORE DIGRAPHS OF DEGREE THREE. 1. Introduction and Preliminaries

The mathematics of RAID-6

MapReduce With Columnar Storage

Storing Measurement Data

Tertiary Storage and Data Mining queries

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:

Sources: On the Web: Slides will be available on:

Encoding Text with a Small Alphabet

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Frans J.C.T. de Ruiter, Norman L. Biggs Applications of integer programming methods to cages

Entropy and Mutual Information

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

Compression techniques

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

Analysis of Binary Search algorithm and Selection Sort algorithm

1 Approximating Set Cover

Introduction to computer science

Data storage Tree indexes

Numbering Systems. InThisAppendix...

Mass Storage Structure

Probability Interval Partitioning Entropy Codes

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Binary Trees and Huffman Encoding Binary Search Trees

arxiv: v1 [math.pr] 5 Dec 2011

Curriculum Map. Discipline: Computer Science Course: C++

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Storage Management for Files of Dynamic Records

ALLIED PAPER : DISCRETE MATHEMATICS (for B.Sc. Computer Technology & B.Sc. Multimedia and Web Technology)

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM

Notes on Factoring. MA 206 Kurt Bryan

Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances

Video-Conferencing System

Lempel-Ziv Factorization: LZ77 without Window

Linear Codes. Chapter Basics

Constructing Digital Signatures from a One Way Function

13 File Output and Input

Regular Languages and Finite Automata

Compressing Forwarding Tables for Datacenter Scalability

Big Data & Scripting Part II Streaming Algorithms

Compression algorithm for Bayesian network modeling of binary systems

Image Compression through DCT and Huffman Coding Technique

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

How To Write A Hexadecimal Program

FCE: A Fast Content Expression for Server-based Computing

Physical Data Organization

Output: struct treenode{ int data; struct treenode *left, *right; } struct treenode *tree_ptr;

CHAPTER 7 GENERAL PROOF SYSTEMS

Introduction to Medical Image Compression Using Wavelet Transform

JOURNAL OF OBJECT TECHNOLOGY

On an algorithm for classification of binary self-dual codes with minimum distance four

Random vs. Structure-Based Testing of Answer-Set Programs: An Experimental Comparison

Quantum Computing Lecture 7. Quantum Factoring. Anuj Dawar

File Management. Chapter 12

zdelta: An Efficient Delta Compression Tool

SoMA. Automated testing system of camera algorithms. Sofica Ltd

Large induced subgraphs with all degrees odd

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

Arrays. number: Motivation. Prof. Stewart Weiss. Software Design Lecture Notes Arrays

Logging. Working with the POCO logging framework.

Chapter 11 I/O Management and Disk Scheduling

Rethinking SIMD Vectorization for In-Memory Databases

Introduction to Scheduling Theory

ASCII Encoding. The char Type. Manipulating Characters. Manipulating Characters

Transcription:

A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of Helsinki, Department of Computer Science P.O. Box 68, 00014 University of Helsinki, Finland 2 Department of Communications and Networking Helsinki University of Technology TKK P.O. Box 3000, 02015 TKK, Finland 3 Celtius Ltd Pieni Robertinkatu 11, 00130 Helsinki, Finland Abstract A method for compressing Steiner triple systems is presented. This method has been used to compress the 11,084,874,829 Steiner triple systems of order 19 into approximately 39 gigabytes of memory. The compressed data can be obtained by contacting the authors. 1 Introduction After the announcement of the completed classification of the Steiner triple systems of order 19 STS(19) for short which was later published in [2], the authors got several requests for a complete collection of the systems in electronic form. Such requests were politely answered by saying that there are in fact more than 11 billion, or precisely N(19) = 11,084,874,829 such objects, so these had not been saved during the search. Indeed, saving the generated systems as 19 57 binary incidence matrices would take about 1.5

terabytes of storage not an altogether impossible amount of data for modern mass storage devices, but certainly cumbersome to handle. It is an instructive exercise to think about how much this data can be practically compressed not only in the case of the STS(19), but also for other collections of combinatorial objects. Theoretically, assuming a collection of Q objects, on average at least log 2 Q bits are required to uniquely represent (identify) an object in the collection. Concatenating the compressed representations of all the objects, at least Qlog 2 Q bits, or approximately 46 gigabytes for Q = N(19), are required. If one relaxes the requirement that each object be represented as a unique string of bits, other possibilities for compressing the objects become apparent. At one extreme, the phrase list all nonisomorphic STS(19) arguably compresses our objects with the human decompressor in mind. In a more theoretically disciplined setting (properly captured in the theory of Kolmogorov complexity [4]), one can ask for the shortest computer program that outputs a complete list of isomorphism class representatives. Considering the total size of the computer programs needed to carry out the original classification about 200 kilobytes one can obviously say that the objects can be compressed into this size in a modern computer, but extracting the objects is then rather slow. In the present study we are interested in obtaining a practical trade-off between requirements in storage space and in decompression time. Here we anticipate that a user is interested in either (a) systematically searching through the objects, or (b) sampling the objects uniformly at random, for example, to test whether they have certain properties. The compression scheme to be presented makes use of the fact that there is a strong degree of similarity between consecutive objects generated by the classification program. The Steiner triple systems of order 19 are compressed into about 39 gigabytes, or little more than 28 bits per system. Decompression of the entire collection of objects takes approximately 15 hours on a Linux PC with a 2-GHz CPU (5µs per isomorphism class), which is about 0.5% of the time required by a classification from scratch with our current implementation. 2 Compression of Steiner Triple Systems 2.1 Preliminaries A Steiner triple system of order v is a pair (P,B), where P is a v-element set of points and B is a set of 3-element subsets of P, called blocks, such that every 2-subset of P occurs in exactly one block. Each point of P necessarily occurs in exactly r = (v 1)/2 blocks, and the number of blocks is b = v(v 1)/6. For convenience in what follows, we assume that the point set is P = {1,2,...,v} and that pairs and triples over P are lexicographically ordered. More precisely, for

pairs {x,y} and {u,v} with x < y and u < v, we define {x,y} < {u,v} if and only if either x < u or x = u, y < v. Similarly, for triples {x,y,z} and {u,v,w} with x < y < z and u < v < w, we define {x,y,z} < {u,v,w} if and only if x < u or x = u, y < v or x = u, y = v, z < w. 2.2 The Compression Algorithm Consider the blocks of an STS in increasing lexicographical order, and observe that the two smallest points of each block form the lexicographically smallest pair that does not occur in any of the preceding blocks. Consequently, the list of the largest elements of the blocks then uniquely determines the STS. This observation leads to effective compression, and we compress a list of STSs by applying the following algorithm to each system: 1. Sort the blocks in ascending lexicographical order B 1 < B 2 < < B b. 2. If the system under consideration is the first one, let i = 0. Otherwise, compute the largest i such that the first i blocks are equal to the first i blocks of the previous STS and store the result. 3. For j = i + 1,i + 2,...,b, consider B j = {x,y,z} with x < y < z. Compute the set P of all points p for which neither {x, p} nor {y, p} occurs in any of the blocks B 1,B 2,...,B j 1. Obviously P is nonempty since z P. Let n = P. With P = {p 1, p 2,..., p n }, p 1 < p 2 < < p n, find the index k for which p k = z and store it. Decompression is achieved with the straightforward inverse operation. Before proceeding further we note that the block B i+1 is never equal to the corresponding block of the previous STS. This fact can be used to further compress the data slightly; however, here we ignore this observation to enable a more streamlined implementation. Using the aforementioned scheme, an STS is transformed into a sequence of 1 + b i integers consisting of the value i and the indices k. The related values of n are obtained by the algorithm. Rough bounds for the values of i and k are 0 i b and 1 k n v 2. We use Huffman coding (see [1] or any other textbook treating source coding) for storing the integer sequences obtained from the STSs. Since different values of n lead to different distributions of the values of k, we have designed one code for each possible value of n. For n = 1 the only codeword is the empty string. A separate code is used to encode the values of i. 2.3 Implementation Details To enable storage on file systems with restrictions on individual file size, and to make a distribution on media with limited storage capacity possible, the com-

pressed data is partitioned into several files. To simplify random access, we further divide the STSs into sets of equal size and compress each set independently (several such sets may still be stored in the same file). We used sets of 500 STSs, but other sizes are also supported. The last set may obviously have fewer STSs. The details of the file format for compressed data are as follows. A file consists of three parts. The first part is simply the byte offset at which the third part begins. The second part consists of the sets of compressed STSs. The third part contains, in this order, a magic number 0xfe86dcdd included as a validity check, the number of STSs contained in the file (which is used for detecting the last STS), the number of STSs in the sets that are independently compressed (currently 500, as mentioned above), and the byte offsets for each set, starting with the first one. The data in the first and third part is represented as 4-byte unsigned integers with bytes written starting with the most significant byte, that is, in big-endian order. (Due to the 4-byte size for byte offsets, the files cannot be arbitrarily large.) In Huffman coding and decoding, a byte is considered as a string of bits starting with the least significant bit. 2.4 Decompression Software The compressed data is accompanied by decompression software. The software contains a simple command-line utility and a library which allows other programs to access the compressed data. The command-line utility, called stsc, can be used in the following ways: stsc -u <inputfile> <outputfile> stsc -c <inputfile> <outputfile> stsc -s <inputfile> <index> Decompression Compression Decompression of single STS Here the index specifies which of the STSs in the input file is to be decompressed; the indices start at 1. To use standard input or output streams instead of ordinary files, specify - as input or output file, respectively. In input and output, each STS is represented as a list of blocks without delimiters. The points are denoted by characters a,b,...,s. For example, {1,2,4}, {2,3,5}, {3,4,6}, {4,5,7}, {1,5,6}, {2,6,7}, {1,3,7},... would be represented as abdbcecdfdegaefbfgacg... Each STS is followed by a newline. The described algorithm works for any order v, but the software is tailored for order 19. Nontrivial changes like defining Huffman codes are needed for other orders. To use the decompression library, the header file compression.h must be included and the library libstsc.a must be linked. This header file defines the preprocessor constants STS_V and STS_B as the number of points and blocks of

an STS, respectively. The following functions are used to decompress data. If an error is detected, an error message is printed and the program is terminated with abort(). file_r *init_decompress(file *) Initializes and returns a new file reader instance. An error is reported in case of a memory allocation failure, an I/O error, or an invalid file. void free_decompress(file_r *) Deallocate the file reader. const int *decompress(file_r *) Decompresses the next STS and returns a pointer to it. Reports an error if the end of the file has been reached, an I/O error occurs, or the file contains invalid data. The pointer remains valid until the next call to decompress() or free_decompress(). void open_set(file_r *, int) Causes decompression to continue at the beginning of the specified set of STSs. Indexing of the sets starts at 0. An error is reported if the index is negative, the index is too large, or an I/O error occurs. int get_sts_count(const file_r *) Returns the number of STSs the file contains. int get_set_size(const file_r *) Returns the size of the sets to which the STSs are divided. int get_offset(const file_r *) Returns the index of the next STS to be decompressed. Indexing starts at 0. As an example, we include the following function, which implements the essential functionality of stsc -s. That is, the function takes a file and an index, and gets the STS in the file at the given index. The indices start at 0. #include "compression.h" const void get_sts(file *f, int index, int *sts) { file_r *r = init_decompress(f); if(index < 0 index >= get_sts_count(r)) { fprintf(stderr, "get_sts: index out of bounds\n"); abort(); } int set_size = get_set_size(r);

} open_set(r, index/set_size); index %= set_size; const int *s; for(i=0 ; i<=index ; ++i) s = decompress(r); memcpy(sts, s, sizeof(int[3*sts_b])); free_decompress(r); The source code of stsc can be considered as a more complete version of this example. The decompression library is also accompanied by sample.c, which generates a sample of the STS(19) uniformly at random. 3 Conclusions A practical compression scheme for Steiner triple systems has been presented and used for compressing the complete set of 11,084,874,829 nonisomorphic systems of order 19 into approximately 39 gigabytes. There is very little, if any, previous work concerning compression of combinatorial objects. One possible idea for further research would be to utilise an existing listing/search algorithm for the objects to enable further compression. In particular, to describe the objects, it suffices to compress the paths in the search tree of the algorithm that lead to the objects one can expect the descriptions of such paths to compress better than the descriptions of the objects themselves. Work is in progress [3] to collect results on various properties of Steiner triple systems of order 19. The authors of [3] are glad to invite colleagues to join this project by contributing with an investigation of a property deemed interesting. The compressed STS(19) data and the decompression software can be obtained from the authors to this end, please send an email to petteri.kaski@cs.helsinki.fi, patric.ostergard@tkk.fi, or olli.pottonen@tkk.fi, and ask for instructions. Acknowledgments The third author was supported in part by the Graduate School in Electronics, Telecommunication and Automation and a grant from the Foundation of Technology (Tekniikan edistämissäätiö). The research was supported in part by the Academy of Finland, Grants 107493, 110196, and 117499. The Department of Computer Science at the University of Helsinki is gratefully acknowledged for

providing computing resources. The authors thank Esa Seuranen for fruitful discussions. References [1] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed., Wiley, New York, 2006. [2] P. Kaski and P. R. J. Östergård, The Steiner triple system of order 19, Math. Comp. 73 (2004), 2075 2092. [3] P. Kaski, P. R. J. Östergård, and O. Pottonen, Properties of the Steiner triple systems of order 19, in preparation. [4] M. Li and P. Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications, 2nd ed., Springer, New York, 1997.