Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

Similar documents
Kafka & Redis for Big Data Solutions

Hypertable Architecture Overview

CS435 Introduction to Big Data

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

Hadoop Ecosystem B Y R A H I M A.

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Big Data With Hadoop

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

NoSQL Data Base Basics

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Cassandra vs MySQL. SQL vs NoSQL database comparison

BIG DATA What it is and how to use?

Yahoo! Cloud Serving Benchmark

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop IST 734 SS CHUNG

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Apache HBase. Crazy dances on the elephant back

Hadoop: Embracing future hardware

Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

A programming model in Cloud: MapReduce

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

A Deduplication File System & Course Review

The Apache Cassandra storage engine

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Big Data & Scripting Part II Streaming Algorithms

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Can the Elephants Handle the NoSQL Onslaught?

Informatica Cloud Connector for SharePoint 2010/2013 User Guide

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Practical Cassandra. Vitalii

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Big Data Patterns. Ron Bodkin Founder and President, Think Big

Distributed File Systems

DataStax Enterprise Reference Architecture

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Speeding Up Cloud/Server Applications Using Flash Memory

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

DEXT3: Block Level Inline Deduplication for EXT3 File System

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Trends in Enterprise Backup Deduplication

Accelerating Cassandra Workloads using SanDisk Solid State Drives

Future Prospects of Scalable Cloud Computing

Bigdata High Availability (HA) Architecture

Benchmarking Cassandra on Violin

Oracle Database In- Memory Op4on in Ac4on

Bigtable is a proven design Underpins 100+ Google services:

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Introduction to Hbase Gkavresis Giorgos 1470

Apache HBase: the Hadoop Database

Cassandra. Jonathan Ellis

CSE-E5430 Scalable Cloud Computing Lecture 2

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

W I S E. SQL Server 2008/2008 R2 Advanced DBA Performance & WISE LTD.

Cassandra A Decentralized, Structured Storage System

Probabilistic Deduplication for Cluster-Based Storage Systems

Cuckoo Filter: Practically Better Than Bloom

SMALL INDEX LARGE INDEX (SILT)

Open source large scale distributed data management with Google s MapReduce and Bigtable

MADOCA II Data Logging System Using NoSQL Database for SPring-8

LARGE-SCALE DATA STORAGE APPLICATIONS

Moving From Hadoop to Spark

Bloom Filters. Christian Antognini Trivadis AG Zürich, Switzerland

Large scale processing using Hadoop. Ján Vaňo

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Case Study : 3 different hadoop cluster deployments

Workshop on Hadoop with Big Data

Hinky: Defending Against Text-based Message Spam on Smartphones

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

File Management. Chapter 12

SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage

Hadoop implementation of MapReduce computational model. Ján Vaňo

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Reference Architecture, Requirements, Gaps, Roles

How To Scale Out Of A Nosql Database

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

Putting Apache Kafka to Use!

Amazon Cloud Storage Options

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

This material is built based on, Patterns covered in this class FILTERING PATTERNS. Filtering pattern

A client side persistent block cache for the data center. Vault Boston Luis Pabón - Red Hat

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Big Table A Distributed Storage System For Data

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Scalable Prefix Matching for Internet Packet Forwarding

Intro to Map/Reduce a.k.a. Hadoop

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Big Data Storage: Should We Pop the (Software) Stack? Michael Carey Information Systems Group CS Department UC Irvine. #AsterixDB

Bloom Filter based Inter-domain Name Resolution: A Feasibility Study

Transcription:

Data Structures for Big Data: Bloom Filter Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

is relative is not defined by a specific number of TB, PB, EB is when it becomes big for you is when your solutions become inefficient/impractical 2 / 30

Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees or (e.g., YARN, NoSQL) (e.g., index, metadata) reached the point of thinking in new DSs for BD 3 / 30

Outline Bloom Filter Use Cases Implementations Other Filters Other Data Structures for Big Data 4 / 30

Membership testing Does my collection contain this element? 5 / 30

City Coimbra Leiria 6 / 30

Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 http://billmill.org/bloomfilter-tutorial/ 7 / 30

City Coimbra Leiria Hash Function Fnv Murmur Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 / 30

City Coimbra Leiria Hash Function Fnv Murmur i=4 i=7 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 / 30

City Coimbra Leiria Hash Function Fnv Murmur i=4 i=7 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 10 / 30

City Coimbra Leiria Hash Function Fnv Murmur Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 11 / 30

City Coimbra Leiria Hash Function Fnv Murmur i=2 i=9 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 12 / 30

City Coimbra Leiria Hash Function Fnv Murmur i=2 i=9 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 13 / 30

City Coimbra Leiria Hash Function Fnv Murmur Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 14 / 30

City Braga Guarda Coimbra Lisboa 15 / 30

City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=10 i=14 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 Result: false 16 / 30

City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=2 i=12 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 Result: false 17 / 30

City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=4 i=7 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 Result: true 18 / 30

City Braga Guarda Coimbra Lisboa Hash Function Fnv Murmur i=7 i=9 Index i bf[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 Result: true (but it is a false positive) 19 / 30

DS proposed by Burton Howard Bloom in 1970 Design principles Space-efficient Smaller than the original dataset Time-efficient Low latency R/W O(k), which is much smaller than O(n) High throughput Probabilistic E.g., mycollection.mightcontain(myobject) False positives happen (but in a configurable way) 20 / 30

Important variables = Expected collection size City Coimbra Leiria = False positive rate (e.g., 0.0001% or 1 in 1M) = Bitmap size 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = Optimal number of hash functions Hash Function Fnv Murmur 21 / 30

Important variables 22 / 30

Users define two of them (normally n and any other) The other two are calculated with those equations Interesting relations: Bigger collection ( ) Larger bitmap ( ) Bigger collection ( ) More false positives ( ) Larger bitmap ( Less false positives ( ) Larger bitmap ( ) Less hash functions ( ) Less hash functions ( ) 23 / 30

Bloom filter size vs. False positive rate 24 / 30

Use Cases Reducing unnecessary disk reads Client BloomFilter Dataset 1 1? No F F 2 2? 2 T necessary read(2) T 3 3? No T unnecessary read(3) F RAM Hard Disk 25 / 30

Use Cases Google BigTable, Apache Cassandra and HBase Reducing disk lookups Google Chrome Lookup a list of known malicious URLs Bitcoin Get only the transactions relevant to your wallet Others In my Ph.D. work Lookup a list of known privacy-sensitive DNA sequences 26 / 30

Implementations -libraries https://code.google.com/p/guava-libraries/ Orestes-Bloomfilter https://github.com/baqend/orestes-bloomfilter java-bloomfilter https://github.com/magnuss/java-bloomfilter java-longfastbloomfilter https://code.google.com/p/java-longfastbloomfilter/ 27 / 30

Other Filters Counting Bloom filters Allow deletions (use a 4-bit counter instead of 1 bit) Buffered Bloom filters Sub-filters in SSD with buffered R/W exploring bit locality Quotient and Cascade filters Uses an SSD, instead of the main memory, for scalability 28 / 30

Other DSs (and techniques) for Big Data Locality-sensitive hashing (LSH) Hashing similar elements into the same bucket with high probability HyperLogLog for computing cardinality Counting the number of distinct elements in a collection Log Structured Merge (LSM) trees Indexed access to files with high insert volume and background batch synchronization 29 / 30

Thank you! Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.