Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

Data Structures for Big Data: Bloom Filter Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

is relative is not defined by a specific number of TB, PB, EB is when it becomes big for you is when your solutions become inefficient/impractical 2 / 30

Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees or (e.g., YARN, NoSQL) (e.g., index, metadata) reached the point of thinking in new DSs for BD 3 / 30

Outline Bloom Filter Use Cases Implementations Other Filters Other Data Structures for Big Data 4 / 30

Membership testing Does my collection contain this element? 5 / 30

City Coimbra Leiria 6 / 30

Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 http://billmill.org/bloomfilter-tutorial/ 7 / 30

City Coimbra Leiria Hash Function Fnv Murmur Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 / 30

City Coimbra Leiria Hash Function Fnv Murmur i=4 i=7 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 / 30

City Braga Guarda Coimbra Lisboa 15 / 30

City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=10 i=14 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 Result: false 16 / 30

City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=2 i=12 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 Result: false 17 / 30

City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=4 i=7 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 Result: true 18 / 30

City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=7 i=9 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 Result: true (but it is a false positive) 19 / 30

DS proposed by Burton Howard Bloom in 1970 Design principles Space-efficient Smaller than the original dataset Time-efficient Low latency R/W O(k), which is much smaller than O(n) High throughput Probabilistic E.g., mycollection.mightcontain(myobject) False positives happen (but in a configurable way) 20 / 30

Important variables = Expected collection size City Coimbra Leiria = False positive rate (e.g., 0.0001% or 1 in 1M) = Bitmap size 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = Optimal number of hash functions Hash Function Fnv Murmur 21 / 30

Important variables 22 / 30

Users define two of them (normally n and any other) The other two are calculated with those equations Interesting relations: Bigger collection ( ) Larger bitmap ( ) Bigger collection ( ) More false positives ( ) Larger bitmap ( Less false positives ( ) Larger bitmap ( ) Less hash functions ( ) Less hash functions ( ) 23 / 30

Bloom filter size vs. False positive rate 24 / 30

Use Cases Reducing unnecessary disk reads Client BloomFilter Dataset 1 1? No F F 2 2? 2 T necessary read(2) T 3 3? No T unnecessary read(3) F RAM Hard Disk 25 / 30

Use Cases Google BigTable, Apache Cassandra and HBase Reducing disk lookups Google Chrome Lookup a list of known malicious URLs Bitcoin Get only the transactions relevant to your wallet Others In my Ph.D. work Lookup a list of known privacy-sensitive DNA sequences 26 / 30

Implementations -libraries https://code.google.com/p/guava-libraries/ Orestes-Bloomfilter https://github.com/baqend/orestes-bloomfilter java-bloomfilter https://github.com/magnuss/java-bloomfilter java-longfastbloomfilter https://code.google.com/p/java-longfastbloomfilter/ 27 / 30

Other Filters Counting Bloom filters Allow deletions (use a 4-bit counter instead of 1 bit) Buffered Bloom filters Sub-filters in SSD with buffered R/W exploring bit locality Quotient and Cascade filters Uses an SSD, instead of the main memory, for scalability 28 / 30

Other DSs (and techniques) for Big Data Locality-sensitive hashing (LSH) Hashing similar elements into the same bucket with high probability HyperLogLog for computing cardinality Counting the number of distinct elements in a collection Log Structured Merge (LSM) trees Indexed access to files with high insert volume and background batch synchronization 29 / 30

Thank you! Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.