Efficient representation of integer sets



Similar documents
A Comparison of Dictionary Implementations

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Persistent Binary Search Trees

Symbol Tables. Introduction

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

1 Abstract Data Types Information Hiding

Sources: On the Web: Slides will be available on:

Binary Heap Algorithms

PES Institute of Technology-BSC QUESTION BANK

From Last Time: Remove (Delete) Operation

10CS35: Data Structures Using C

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

TREE BASIC TERMINOLOGIES

C Programming. for Embedded Microcontrollers. Warwick A. Smith. Postbus 11. Elektor International Media BV. 6114ZG Susteren The Netherlands

Algorithms and Data Structures

The C Programming Language course syllabus associate level

C++ Programming Language

Converting a Number from Decimal to Binary

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Logic in Computer Science: Logic Gates

Section IV.1: Recursive Algorithms and Recursion Trees

S. Muthusundari. Research Scholar, Dept of CSE, Sathyabama University Chennai, India Dr. R. M.

Ordered Lists and Binary Trees

Chapter 8: Bags and Sets

Analysis of a Search Algorithm

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

Testing LTL Formula Translation into Büchi Automata

A Static Analyzer for Large Safety-Critical Software. Considered Programs and Semantics. Automatic Program Verification by Abstract Interpretation

Chapter One Introduction to Programming

Lecture Notes on Binary Search Trees

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

Data Structures and Algorithms

Embedded Programming in C/C++: Lesson-1: Programming Elements and Programming in C

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

DNA Data and Program Representation. Alexandre David

On the number of lines of theorems in the formal system MIU

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Structure for String Keys

Compiling CAO: from Cryptographic Specifications to C Implementations

Computer Science 281 Binary and Hexadecimal Review

Data Structure and Algorithm I Midterm Examination 120 points Time: 9:10am-12:10pm (180 minutes), Friday, November 12, 2010

Data Structure with C

Data Structures Fibonacci Heaps, Amortized Analysis

Lecture Notes on Binary Search Trees

Data Structures and Algorithms

Keywords are identifiers having predefined meanings in C programming language. The list of keywords used in standard C are : unsigned void

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

: provid.ir

The Set Data Model CHAPTER What This Chapter Is About

Full and Complete Binary Trees

AQA GCSE in Computer Science Computer Science Microsoft IT Academy Mapping

In-Memory Database: Query Optimisation. S S Kausik ( ) Aamod Kore ( ) Mehul Goyal ( ) Nisheeth Lahoti ( )

1. Relational database accesses data in a sequential form. (Figures 7.1, 7.2)

Regular Expressions and Automata using Haskell

Top 10 Bug-Killing Coding Standard Rules

Physical Data Organization

Introduction Advantages and Disadvantages Algorithm TIME COMPLEXITY. Splay Tree. Cheruku Ravi Teja. November 14, 2011

Chapter 14 The Binary Search Tree

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

TECHNICAL UNIVERSITY OF CRETE DATA STRUCTURES FILE STRUCTURES

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

DYNAMIC DOMAIN CLASSIFICATION FOR FRACTAL IMAGE COMPRESSION

Data Structures and Algorithms!

The programming language C. sws1 1

I PUC - Computer Science. Practical s Syllabus. Contents

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

NUMBER SYSTEMS APPENDIX D. You will learn about the following in this appendix:

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

How To Write Portable Programs In C

7.1 Our Current Model

2) Write in detail the issues in the design of code generator.

CSE 326: Data Structures B-Trees and B+ Trees

An Evaluation of Self-adjusting Binary Search Tree Techniques

Efficient Data Structures for Decision Diagrams

root node level: internal node edge leaf node Data Structures & Algorithms McQuain

Semantic Analysis: Types and Type Checking

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Data Structures and Algorithms!

Instruction Set Architecture (ISA)

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

2) What is the structure of an organization? Explain how IT support at different organizational levels.

Lecture 11 Doubly Linked Lists & Array of Linked Lists. Doubly Linked Lists

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

A System for Program Visualization in the Classroom

Structural Design Patterns Used in Data Structures Implementation

Cpt S 223. School of EECS, WSU

Properties of Stabilizing Computations

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

PART-A Questions. 2. How does an enumerated statement differ from a typedef statement?

Data Structures. Level 6 C Module Descriptor

El Dorado Union High School District Educational Services

Fundamental Algorithms

UIL Computer Science for Dummies by Jake Warren and works from Mr. Fleming

DATA STRUCTURES USING C

Automata-based Verification - I

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

High-performance local search for planning maintenance of EDF nuclear park

Transcription:

Efficient representation of integer sets Marco Almeida Rogério Reis Technical Report Series: DCC-2006-06 Version 1.0 Departamento de Ciência de Computadores & Laboratório de Inteligência Artificial e Ciência de Computadores Faculdade de Ciências da Universidade do Porto Rua do Campo Alegre, 1021/1055, 4169-007 PORTO, PORTUGAL Tel: 220 402 900 Fax: 220 402 950 http://www.dcc.fc.up.pt/pubs/

Efficient representation of integer sets Marco Almeida Rogério Reis {mfa,rvr}@ncc.up.pt DCC-FC & LIACC, Universidade do Porto R. do Campo Alegre 1021/1055, 4169-007 Porto, Portugal Abstract We present a solution that uses bitmaps and bitwise operators to represent nonnegative integer sets and implement some common set theoretical operations. We also show how it is possible to extend this representation to set partitions. 1 Introduction Finite sets are a common and frequently used data structure. Several algorithms make heavy use of these objects, and, in some cases, they are at the very core of the algorithm. This makes it necessary to have a simple and efficient way to represent and manipulate these objects. We present a solution that uses bitmaps and bitwise operators to represent nonnegative integer sets and implement some common set theoretical operations. We also show how it is possible to extend this representation to set partitions. This representation was used in an implementation of Hopcroft s automata minimization algorithm as part of the FAdo [fad] project. Some experimental results were presented in the DCFS 2006 [AMR06]. This report is organized as follows. In the next Section we describe some computational advantages of our bitmap representation. In Section 3 we show how to use a bitmap to represent a set and implement some basic set theoretical operations. In Section 4 we explain how to complete the bitmap representation so that we can also deal with set partitions. In Section 5 we present the implementation details of a set library based on this representation and in Section 2 we describe some computational advantages. The complete API is given in Appendix A. 2 Computational advantages Boolean algebras can represent basic operations of both set theory and propositional logic. Using bit sequences and propositional logic connectives, we can represent finite non-negative integer sets and implement some basic set theoretical operations such as union, intersection and difference. In computational terms, there are two main advantages that come directly out of this representation. First, the amount of memory needed by this data structure is very small. Second, any general purpose modern CPU has very efficient implementations of bitwise operations. These instructions require very few clock cycles to complete, even when compared to the most elementary arithmetic instructions. Bitmaps present significant advantages over equivalent simple set data structures such as arrays or linked lists. These are compact data structures which allow to store n elements 2

in n/w memory words. Small bitmaps (those with a size that does not exceed the word size) can be stored and manipulated in registers during the execution of several instructions without the need to access memory. This allows to exploit bit-level parallelism, limit the memory access and maximize the use of the data cache, making it possible to obtain significant speed gains. 3 Sets with bitmaps Using a bitmap with n bits we can represent a set of non-negative integers in the range 0,...,n 1 in the following way. For each bit in the bitmap, if the bit s value is 1, we assume that the integer corresponding to the bit s position in the bitmap is a member of the set. The rest of the bits represent integers that do not belong to the set. Let us consider the sets A = {2, 3, 5} and B = {0, 2, 4}. With bitmaps of size eight these sets are written as 01110100 and 10101000, respectively. Bitwise operators can now be used to implement some common set theoretical operations. The bitwise AND and OR, for example, allow us to calculate the intersection and union, of two given sets, respectively. The unary bitwise not operator can be used to obtain a set s complement. Some examples are: A B = {2} A B = {0, 2, 3, 4, 5} Ā = {0, 1, 4, 6, 7} 00110100 AND 10101000 = 00100000 00110100 OR 10101000 = 10111100 NOT 00110100 = 11001011 Other operations, such as set difference, can be implemented by combining these basic bitwise operators. 4 Set Partitions with bitmaps Formally, a partition of a given set A is a set {A i } i I, where I is an index set, such that: i I, A i A A = i I A i i, j I, i j, A i A j = We call each A i a block of the partition. If we tried to apply the former ideas to the representation of partitions, some problems would arise. In general, for any finite set S with A = n 0, P(A) = 2 n. A bitmap for a set partition of a set with n elements would have 2 n bits. It would also be extremely sparse, with only n bits actually used. A partition of a set with 32 bits would require a bitmap with 4294967296 bits. This is not a viable solution. However, a partition P of a set S with n elements, can not have more than n blocks. Since each block is a subset of S, it can not hold any more than n elements. If we use bitmaps to represent each of the blocks in P, we will be needing, at most, n bitmaps with n bits each, in a total of n 2 bits. This is a considerably smaller bitmap, and, for the same set of 32 elements of the previous example, only 1024 bits are required. In more practical terms, we are representing a partition not as a set of sets, but as an array of sets. This way, the 3

space needed to represent a partition grows quadratic and not exponential. The following example illustrates this approach. Let us consider the set S = {0, 2, 3, 4} and a partition P = {{0}, {2, 3}, {4}}. The bitmap 10111, with five bits, is enough to represent S. The naive approach we first described uses a bitmap B, with 2 5 bits, to represent P, B = 10011000000000000000000000000000. Of the whole thirty two bits, only the first, the fourth and the fifth are active. Instead, if we use the proposed alternative, we need only five bitmaps with five bits each, which will require a total of twenty five bits. B 0 = 10000 B 1 = 00110 B 2 = 00001 B 3 = 00000 B 4 = 00000 Each one of these bitmaps B i represents a block of the partition P and can be written as an array. A = [0000, 00110, 00001, 00000, 00000] This representation allows us to avoid dealing with a huge, sparse, bitmap while making use of all the advantages of the integer set representation we described earlier. 5 Implementation When efficiency is a concern, representing set partitions with bitmap arrays is a solution that actually solves some problems in a simple and elegant way. There are, however, some set theoretical operations that require special attention. Finding out if some set S is a block of a given partition P, for example, is no longer a matter of performing a simple bitwise AND between P and S except, of course, if S has the same size as P, but this would imply a waste of space that we wish to avoid. In a more general way, operations that need to search the partition (such as block removal or existence testing) can be quite expensive in computational terms. We propose a solution based on AVL trees. These are binary search trees on which the lookup, insertion and deletion operations take O(log n) time. 5.1 AVL trees An AVL tree (or height-balanced tree) [Knu98] is a self-balancing binary search tree. The balance factor of a node is the difference between the height of its subtrees. An AVL tree is considered balanced if every node has a balance factor of, at most, one. A node with any other balance factor is considered unbalanced and requires re-balancing the tree. In a balanced AVL tree, searching, inserting and removing nodes are operations that require O(log n) time (where n is the number of nodes in the tree) on both the average and the worst case. Clearly, the insertion or removal of nodes may force the tree to be rebalanced. In a general way, this is a very efficient data structure that actually outperforms red-black trees or splay trees in search intensive applications [Pfa04]. 4

a) b) Figure 1: a) example of an unbalanced tree; b) the same tree after re-balancing (now an AVL) 5.2 Bitmap arrays with AVL trees In Section 4 we proposed a representation of set partitions based on bitmap arrays. It is quite easy to have a total order over the elements of such an array. From the definition of set partition, we know that the intersection of any two blocks must result in the empty set and that every block contains at least one element. This allows us to order the blocks, for example, by comparing the first element of each block. Since we are using sets of non-negative integers, the usual relation can be used. Having a total order defined, we can now use an AVL tree to represent the array of bitmaps. This way, all the operations necessary to the implementation of the usual set theoretical functions (searching, inserting and removing elements) will take O(log n) time. In an unordered array (or linked list for that matter) search alone is an O(n) operation. 5.3 The library Using the methods described in the previous sections, we have written a small library to handle non-negative integer sets. This library, zset, is written in the C programming language and implements the usual set theoretical operations. Because of the need to deal with arbitrarily big bitmaps, which may exceed the size of a C integer (be it 32 or 64 bits), we used the GMP library. GMP is a free library for arbitrary precision arithmetic. It has a rich set of functions, with a regular interface and is designed to be as fast as possible, both for small and huge operands. We used GMP s multiple precision integers (mpz t type) to represent bitmaps. The logical and bit manipulation functions were used instead of the bitwise operators to implement all set theoretical functions. As for the set partitions, and given the complexity of the algorithms involved, mainly the ones required to maintain balance, we used a small implementation of AVL trees for the C programming language [avl]. This library is also freely available under the LGPL license. Some of the functions are simple enough to be implemented as macros. While at first this may look a good idea for efficiency concerns, it is not. Because the preprocessor replaces 5

every macro call with its definition, macros are unknown to the debugger. This feature not only makes the debugging process a lot more complicated but it also renders profiling impossible. Instead of macros, we used the inline keyword in all function declarations. This is part of the ISO C99 standard and instructs the compiler to integrate the function s code into the code for its callers. It is just as fast as a macro [StGDC05] but allows the debugger to follow the function calls, thus enabling both debug and profiling. A API We now present the API for zset, the library described in the previous section. Module set Data types typedef unsigned long int zset elm t; set element data type typedef avl tree t * partition t; set partition data type typedef struct zset { mpz t s; unsigned long int n; } *zset t; set data type Macros #ifdef INLINE #define INLINE DECL extern inline #else #define INLINE DECL static #endif Functions INLINE DECL zset t zsetinit(unsigned long n) initializes a set with n elements INLINE DECL void zsetdestroy(zset t s) frees the memory space allocated for set s INLINE DECL void zsetadd(zset t s, zset elm t e) adds element e to set s INLINE DECL void zsetdel(zset t s, zset elm t e) removes element e from set s INLINE DECL void zsetintersection(zset t s, zset t s1, zset t s2) s = s 1 s 2 INLINE DECL void zsetunion(zset t s, zset t s1, zset t s2) s = s 1 s 2 INLINE DECL void zsetdifference(zset t s, zset t s1, zset t s2) s = s 1 s 2 INLINE DECL int zsethaselm(zset t s, zset elm t e) returns 0 if e / s INLINE DECL int zsetissubzset(zset t s1, zset t s2) returns 0 if s 1 s 2 INLINE DECL unsigned long int zsetcardinal(zset t s) returns the number of elements in s INLINE DECL partition t partitioninit(void) initializes a partition INLINE DECL void partitiondestroy(partition t p) frees the memory space allocated for partition p INLINE DECL void partitionadd(partition t p, zset t s) 6

References adds set s to partition p INLINE DECL void partitiondel(partition t p, zset t s) remove set s from partition p INLINE DECL int partitioncardinal(partition t p) returns the number of blocks in partition p [AMR06] [avl] [fad] [Knu98] Marco Almeida, Nelma Moreira, and Rogério Reis. Aspects of enumeration and generation with a string automata representation. In 8th International Workshop on Descriptional Complexity of Formal Systems (DCFS), 2006. Gnu libavl, binary search trees library. http://www.stanford.edu/ blp/avl/. FAdo: tools for formal languages manipulation. http://www.ncc.up.pt/fado. Donald E. Knuth. The Art of Computer Programming. Searching and Sorting., volume 3. AW, 1998. [Pfa04] B. Pfaff. Performance analysis of bsts in system software. SIGMET- RICS/Performance poster, June 2004. [StGDC05] Richard M. Stallman and the GCC Developer Community. Using the GNU Compiler Collection (GCC). Free Software Foundation, 2005. Version 4.0.2. 7