# Big Data & Scripting Part II Streaming Algorithms

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Big Data & Scripting Part II Streaming Algorithms 1,

2 Counting Distinct Elements 2,

3 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set of allowed ids elements can appear arbitrary often, not all of them appear question: how many different elements did appear? application example: counting the number of visitors (IPs) of a website storing the elements is not an option estimate number using hashing and probabilities

4 4, Flajolet-Martin algorithm idea binary function f : X {0, 1}, x i = x j f (x i ) = f (x j ) P(f (x i ) = 1) = p identically, independently distributed for all x i assume we encountered m different x i 1 what is the probability P m, that f (x i ) = 0 for all i? P 1 = (1 p) for one x i, P m = (1 p) m for m different x i P m depends on the number of unique x i, not on the total number exploit this to get an estimate of m 1 that s actually what we try to find out

5 5, Flajolet-Martin algorithm idea (1 p) m = ( ) (1 p) 1 mp p e mp for small p: (1 p) 1 p e 1 (Euler s number) probability P m that f (x i ) = 0 for m different x i P m = e mp if m 1/p: P m 1 if m 1/p: P m 0 p should be small (for derivation and for large m)

6 6, Flajolet-Martin algorithm bitmask & tail length use hash function h : X {0, 1} K e.g. binary representation of hash values assume bits of h(x i ) are uniform, independent distributed the tail idea: count number of trailing zeros tail(h) = max{j : h mod 2 j = 0} bitmask introduce a new random variable bitmask {0, 1} M initialize bitmask(i) = 0 for i = 0... M 1 (zero indexed) for every x i : bitmask(tail(h(x i ))) 1

7 7, Flajolet-Martin algorithm bitmask & tail length bitmask() keeps tracks of tail lengths seen so far used to extend approximate counting by random variables bitmask(0) is expected to be set to 1 by 1/2 of the x i bitmask(1) is expected to be set to 1 by 1/4 of the x i,... bitmask(i) is almost certainly 0 if i log 2 m bitmask(i) is almost certainly 1 if i log 2 m (recall that m is the number of unique elements seen ) for counting, derive random variable R: R(bitmask) = min{i : bitmask(i) = 0} it can be shown that: E(R) log 2 ϕm where ϕ = m can be approximated as m ϕ 1 2 R

8 8, Flajolet-Martin algorithm accuracy the idea so far probability of f (x i ) = 1 depends m use this for estimation for tail length: k longest tail of hash value observed estimate 2 k elements problems/refinements R is not very precise (large standard deviation σ(r) 1.12) precision could be increased by averaging multiple functions but: determining h(x i ) is costly

9 9, Flajolet-Martin algorithm accuray using k different hash functions (and bitmasks) increases accuracy: R = 1/k k j=1 R(bitmask j ) has std. dev. 1.12/ k estimate is exponential of R in large influence on mean value improve accuracy with only one hash function: use k different bitmaps bitmask 0,..., bitmask k 1 for each x i update only bitmask (h(x i ) mod k) ( k 1 ) R = 1/k j=0 R(bitmaskj ) std. dev. 1.12/ k still holds

10 10, Flajolet-Martin algorithm input: stream of numbers X (e.g. converted strings) precision parameter k output: estimate m of unique numbers within X algorithm: bitmask=bitmask[k] // array of bitmasks initialized with 0 while x i = readnext(){ h=h(x i ) update bitmask[h mod k] with tail(h) } mean=mean({r(bitmask[j]) : j = 0,..., k}) return(2 mean / ); source and additional information: Flajolet, P.; Martin, G. N.: Probabilistic counting algorithms for data base applications, 1985

11 11, sampling in streams 2 sample descriptive subset of the original data approximate true distribution problem selection of samples (how?) introduction of bias sampling over time 2 C.f. Book: Mining of Massive Datasets, Rajaraman, Ullman, 2011, available free, online, link on website.

12 12, sampling example We use the following example setting to demonstrate problems in sampling: data and problem incoming data: data points: (user, query, time) e.g. DB system serving clients queries appear repeatedly goal: estimate number of unique 2,3,... times repeated queries

13 13, sampling naive approach (does not work) sample 10% of the data points (randomly) estimate repetitions from resulting sample example: unique and double queries unique queries A and double queries B 1,B 2 P(A in sample) = 0.1, analogous for B 1 and B 2 P(B 1 in sample and B 2 not in sample) = 0.1 (1 0.1) = 0.09 in the sample, these appear as unique queries! P(B 1 and B 2 in sample) = = 0.01

14 14, sampling using hash functions solution: frequency-independent sampling sample items independent of their frequency if item is in sample, retain all copies important assumption: objects are approximately equally distributed to hash values (buckets) algorithm - sample (j/k)n elements hash function h : X {0,..., k 1} if(h(x i ) < j) add x i to sample works, if N is known in advance and j/k is small enough can be adapted to arbitrary N and fixed sample size

15 15, sampling adaptive sample size problem number of distinct objects unknown in the beginning sample should be of fixed or limited size approach: use large number of buckets (e.g. keep all x i ) sample x i as before (store each sample in it s bucket) when sample grows too large, drop buckets from sample

16 16, sampling a concrete example problem: extract 2GB sample from incoming ids input objects with id as string s 0 s 1... s k, k {3,..., 20} s i {a,..., z, A,..., Z} = Var Var = 52 convert id to number: assign value to each character: 3 v(a) = 0, v(b) = 1,..., v(z) = 51 convert to number: number(s 0... s k ) = k i=1 v(s i ) 52 i for implementation, use Horner s method: v 0 = s k, v 1 = v s k 1,..., number(s 0... s k ) = v k+1 apply hash function h(v) = v mod 1000 sample h(v) < s starting with s = 1000 when mem usage of samples > 2GB, k is decremented by 10 3 or use their ASCII/UTF value

17 17, sampling problems characters are often not equally distributed (e.g. in languages) we ignore that for now assume, hashing takes care of that hash values get very large (e.g = ) problem: primitive data types (int, long) are limited counter measure: apply h() to v i in every step works if a = 0 because of modulo arithmetic what if k = 1 and sample is still >2GB? rerun, use more buckets (e.g ) or use second stage of hashing: apply new hash function to selected samples (those passing the first filter only)

18 18, maintaining sets motivating scenario a sample S X can be used to learn properties of the overall distribution example: queries and their frequencies in the following, we derive further statistics about the samples e.g.: frequency distributions (how many queries have certain frequencies) we need to decide, whether x i S we could exploit the filters resulting from the sampling process problem: new (unseen) objects could end up in sample buckets aim: decide set membership without further information

19 19, maintaining sets a first approach use fine grained hashing to split samples into small chunks remember buckets that contain x i S discard elements not in these buckets result: dirty filter if x i S it is retained if x i / S it might be discarded in the following: improve this general idea using hash functions Bloom Filter

20 Bloom filter idea and introduction input: large set S of objects problem: decide whether x i S without storing S formalizing the idea - (not the Bloom filter yet) hash function h : X H, H very large { 1, if s S : h(s) = i bit array B with B[i] = 0, else for each s S B[h(s)] == 1 for each s / S B[h(s)] == 1 (false positive) or B[h(s)] == 0 (reject) elements of S pass, others might pass but hopefully not all next: try to minimize false positives 20,

21 21, Bloom filter definition and usage Bloom filter set S and bit array B as before collection of hash functions: h 1,..., h k mapping to H set B[i] = 1, if for any s S and any h j : h j (s) = i object o passes, if B[h i (o)] = 1 for all i {1,..., k} B[h i (o)] = 0 for any i, o does not pass spreading to more hash functions avoids collisions sufficient, if only one h i does not produce a collision allows derivation of probabilistic properties e.g. how many h i, and size of H for a given S

22 22, Bloom filter: properties and extensions array size A = m (memory usage) e.g k hash functions (time consumption of test) e.g. 5 S = n elements stored (yield in a sense) e.g approximate probability of false positive: ( 1 e kn/m) k ( example: 1 e /10 6) (0.94%) mem usage: 10 6 bits 122 kbyte (10 bits/id) assume id has 10 bytes: 10 7 bytes 9.5 MB conclusions: more memory improves probability, lesser elements (n), too more hash functions not necessarily! e.g. (values from above) k = 7 prob 0.82%, k = 10 prob 1% each m, n pair has optimal k = n m ln(2)

23 23, wrap up: sampling and filtering straightforward sampling can go wrong correct sample by fixing selection e.g. limit to certain users/ids decide if in sample or not with hash function does that solve our count duplicates problem? filter large set of objects with Bloom filter

24 24, insertion: difference between streaming and online algorithms: online scenario: unlimited stream one pass answers ready all the time (e.g. sliding window) streaming scenario several passes possible, usually avoided answers usually at end of execution

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

### Big Data & Scripting storage networks and distributed file systems

Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node

### Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman To motivate the Bloom-filter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It

### Mining Data Streams. Chapter 4. 4.1 The Stream Data Model

Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make

### MODELING RANDOMNESS IN NETWORK TRAFFIC

MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of

### Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we

### Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Chapter Objectives Chapter 9 Search Algorithms Data Structures Using C++ 1 Learn the various search algorithms Explore how to implement the sequential and binary search algorithms Discover how the sequential

### Hierarchical Bloom Filters: Accelerating Flow Queries and Analysis

Hierarchical Bloom Filters: Accelerating Flow Queries and Analysis January 8, 2008 FloCon 2008 Chris Roblee, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### A Time Efficient Algorithm for Web Log Analysis

A Time Efficient Algorithm for Web Log Analysis Santosh Shakya Anju Singh Divakar Singh Student [M.Tech.6 th sem (CSE)] Asst.Proff, Dept. of CSE BU HOD (CSE), BUIT, BUIT,BU Bhopal Barkatullah University,

### Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

### Theoretical Aspects of Storage Systems Autumn 2009

Theoretical Aspects of Storage Systems Autumn 2009 Chapter 3: Data Deduplication André Brinkmann News Outline Data Deduplication Compare-by-hash strategies Delta-encoding based strategies Measurements

### 1 Digital Signatures. 1.1 The RSA Function: The eth Power Map on Z n. Crypto: Primitives and Protocols Lecture 6.

1 Digital Signatures A digital signature is a fundamental cryptographic primitive, technologically equivalent to a handwritten signature. In many applications, digital signatures are used as building blocks

### To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

Binary Numbers In computer science we deal almost exclusively with binary numbers. it will be very helpful to memorize some binary constants and their decimal and English equivalents. By English equivalents

### Packet forwarding using improved Bloom filters

Packet forwarding using improved Bloom filters Thomas Zink thomas.zink@uni-konstanz.de A Master Thesis submitted to the Department of Computer and Information Science University of Konstanz in fulfillment

### Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples

### CS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

CS 2112 Spring 2014 Assignment 3 Data Structures and Web Filtering Due: March 4, 2014 11:59 PM Implementing spam blacklists and web filters requires matching candidate domain names and URLs very rapidly

### Primitive Data Types Summer 2010 Margaret Reid-Miller

Primitive Data Types 15-110 Summer 2010 Margaret Reid-Miller Data Types Data stored in memory is a string of bits (0 or 1). What does 1000010 mean? 66? 'B'? 9.2E-44? How the computer interprets the string

### Big Data and Scripting

Big Data and Scripting 1, 2, Big Data and Scripting - abstract/organization contents introduction to Big Data and involved techniques schedule 2 lectures (Mon 1:30 pm, M628 and Thu 10 am F420) 2 tutorials

### Enhancing Collaborative Spam Detection with Bloom Filters

Enhancing Collaborative Spam Detection with Bloom Filters Jeff Yan Pook Leong Cho Newcastle University, School of Computing Science, UK {jeff.yan, p.l.cho}@ncl.ac.uk Abstract Signature-based collaborative

### COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Binary numbers The reason humans represent numbers using decimal (the ten digits from 0,1,... 9) is that we have ten fingers. There is no other reason than that. There is nothing special otherwise about

### Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Data Reduction: Deduplication and Compression Danny Harnik IBM Haifa Research Labs Motivation Reducing the amount of data is a desirable goal Data reduction: an attempt to compress the huge amounts of

### Software-Defined Traffic Measurement with OpenSketch

Software-Defined Traffic Measurement with OpenSketch Lavanya Jose Stanford University Joint work with Minlan Yu and Rui Miao at USC 1 1 Management is Control + Measurement control - Access Control - Routing

### CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

3 CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases About This Document 3 Methods for Accessing Relational Database Data 4 Selecting a SAS/ACCESS Method 4 Methods for Accessing DBMS Tables

### Secure Computation Martin Beck

Institute of Systems Architecture, Chair of Privacy and Data Security Secure Computation Martin Beck Dresden, 05.02.2015 Index Homomorphic Encryption The Cloud problem (overview & example) System properties

### Subnetting Examples. There are three types of subnetting examples I will show in this document:

Subnetting Examples There are three types of subnetting examples I will show in this document: 1) Subnetting when given a required number of networks 2) Subnetting when given a required number of clients

### More Data Mining with Weka

More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz More Data Mining with Weka a practical course

### Innovative Techniques and Tools to Detect Data Quality Problems

Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful

### Probabilistic Deduplication for Cluster-Based Storage Systems

Probabilistic Deduplication for Cluster-Based Storage Systems Davide Frey, Anne-Marie Kermarrec, Konstantinos Kloudas INRIA Rennes, France Motivation Volume of data stored increases exponentially. Provided

### New Hash Function Construction for Textual and Geometric Data Retrieval

Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan

### SAS Data Set Encryption Options

Technical Paper SAS Data Set Encryption Options SAS product interaction with encrypted data storage Table of Contents Introduction: What Is Encryption?... 1 Test Configuration... 1 Data... 1 Code... 2

### Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

### 16.1 MAPREDUCE. For personal use only, not for distribution. 333

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

### COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

### Big Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs

1 Big Data Analytics Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs Montevideo, 22 nd November 4 th December, 2015 INFORMATIQUE

### A First Book of C++ Chapter 2 Data Types, Declarations, and Displays

A First Book of C++ Chapter 2 Data Types, Declarations, and Displays Objectives In this chapter, you will learn about: Data Types Arithmetic Operators Variables and Declarations Common Programming Errors

### Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven) Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii)

### Physical Data Organization

Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

### Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

### Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

Data Structures for Big Data: Bloom Filter Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014. is relative is not defined by a specific number of TB, PB, EB is when it becomes big for you is

### File System Management

Lecture 7: Storage Management File System Management Contents Non volatile memory Tape, HDD, SSD Files & File System Interface Directories & their Organization File System Implementation Disk Space Allocation

### Distance Degree Sequences for Network Analysis

Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation

### Unit 4.3 - Storage Structures 1. Storage Structures. Unit 4.3

Storage Structures Unit 4.3 Unit 4.3 - Storage Structures 1 The Physical Store Storage Capacity Medium Transfer Rate Seek Time Main Memory 800 MB/s 500 MB Instant Hard Drive 10 MB/s 120 GB 10 ms CD-ROM

### Bloom Filters. Christian Antognini Trivadis AG Zürich, Switzerland

Bloom Filters Christian Antognini Trivadis AG Zürich, Switzerland Oracle Database uses bloom filters in various situations. Unfortunately, no information about their usage is available in Oracle documentation.

### SQL Server An Overview

SQL Server An Overview SQL Server Microsoft SQL Server is designed to work effectively in a number of environments: As a two-tier or multi-tier client/server database system As a desktop database system

### Kafka & Redis for Big Data Solutions

Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)

### Counting Problems in Flash Storage Design

Flash Talk Counting Problems in Flash Storage Design Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu NVRAMOS 09, Jeju, Korea, October 2009-1-

### The Fundamentals of C++

The Fundamentals of C++ Basic programming elements and concepts JPC and JWD 2002 McGraw-Hill, Inc. Program Organization Program statement Definition Declaration Action Executable unit Named set of program

### MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services Jiansheng Wei, Hong Jiang, Ke Zhou, Dan Feng School of Computer, Huazhong University of Science and Technology,

### Lecture 7: Hashing III: Open Addressing

Lecture 7: Hashing III: Open Addressing Lecture Overview Open Addressing, Probing Strategies Uniform Hashing, Analysis Cryptographic Hashing Readings CLRS Chapter.4 (and.3.3 and.5 if interested) Open Addressing

### Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

Chapter 13 Disk Storage, Basic File Structures, and Hashing Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files

### Factoring & Primality

Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount

### Raima Database Manager Version 14.0 In-memory Database Engine

+ Raima Database Manager Version 14.0 In-memory Database Engine By Jeffrey R. Parsons, Senior Engineer January 2016 Abstract Raima Database Manager (RDM) v14.0 contains an all new data storage engine optimized

### Private Record Linkage with Bloom Filters

To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,

### Chapter 13. Disk Storage, Basic File Structures, and Hashing

Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible Hashing

### This material is built based on, Patterns covered in this class FILTERING PATTERNS. Filtering pattern

2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 1 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 2 PART 0. INTRODUCTION TO BIG DATA PART 1. MAPREDUCE AND THE NEW SOFTWARE STACK 1. DISTRIBUTED

### Digital Signatures. (Note that authentication of sender is also achieved by MACs.) Scan your handwritten signature and append it to the document?

Cryptography Digital Signatures Professor: Marius Zimand Digital signatures are meant to realize authentication of the sender nonrepudiation (Note that authentication of sender is also achieved by MACs.)

### Speeding Up Cloud/Server Applications Using Flash Memory

Speeding Up Cloud/Server Applications Using Flash Memory Sudipta Sengupta Microsoft Research, Redmond, WA, USA Contains work that is joint with B. Debnath (Univ. of Minnesota) and J. Li (Microsoft Research,

### Lecture 2: Universality

CS 710: Complexity Theory 1/21/2010 Lecture 2: Universality Instructor: Dieter van Melkebeek Scribe: Tyson Williams In this lecture, we introduce the notion of a universal machine, develop efficient universal

### Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Fan Deng University of Alberta fandeng@cs.ualberta.ca Davood Rafiei University of Alberta drafiei@cs.ualberta.ca ABSTRACT

### FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

CS535 Big Data - Fall 2015 W1.B.1 CS535 Big Data - Fall 2015 W1.B.2 CS535 BIG DATA FAQs Wait list Term project topics PART 0. INTRODUCTION 2. A PARADIGM FOR BIG DATA Sangmi Lee Pallickara Computer Science,

### Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important

### MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

### FAWN - a Fast Array of Wimpy Nodes

University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed

### Estimating Deduplication Ratios in Large Data Sets

IBM Research labs - Haifa Estimating Deduplication Ratios in Large Data Sets Danny Harnik, Oded Margalit, Dalit Naor, Dmitry Sotnikov Gil Vernik Estimating dedupe and compression ratios some motivation

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com

Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data

### 198:211 Computer Architecture

198:211 Computer Architecture Topics: Lecture 8 (W5) Fall 2012 Data representation 2.1 and 2.2 of the book Floating point 2.4 of the book 1 Computer Architecture What do computers do? Manipulate stored

### Type of Submission: Article Title: DB2 s Integrated Support for Data Deduplication Devices Subtitle: Keywords: DB2, Backup, Deduplication

Type of Submission: Article Title: DB2 s Integrated Support for Data Deduplication Devices Subtitle: Keywords: DB2, Backup, Deduplication Prefix: Error! Bookmark not defined. Given: Dale Middle: M. Family:

### Arithmetic Coding: Introduction

Data Compression Arithmetic coding Arithmetic Coding: Introduction Allows using fractional parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation

### Hashing in Networked Systems

LB Server Cluster Switches Hashing in Networked Systems COS 461: Computer Networks Spring 2011 Mike Freedman h@p://www.cs.princeton.edu/courses/archive/spring11/cos461/ 2 Hash funcion Hashing FuncIon that

### This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

### F-Secure Internet Security 2014 Data Transfer Declaration

F-Secure Internet Security 2014 Data Transfer Declaration The product s impact on privacy and bandwidth usage F-Secure Corporation April 15 th 2014 Table of Contents Version history... 3 Abstract... 3

### Breaking An Identity-Based Encryption Scheme based on DHIES

Breaking An Identity-Based Encryption Scheme based on DHIES Martin R. Albrecht 1 Kenneth G. Paterson 2 1 SALSA Project - INRIA, UPMC, Univ Paris 06 2 Information Security Group, Royal Holloway, University

### Quantum Computing Lecture 7. Quantum Factoring. Anuj Dawar

Quantum Computing Lecture 7 Quantum Factoring Anuj Dawar Quantum Factoring A polynomial time quantum algorithm for factoring numbers was published by Peter Shor in 1994. polynomial time here means that

### Large-Scale Test Mining

Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

### Jonathan Worthington Scarborough Linux User Group

Jonathan Worthington Scarborough Linux User Group Introduction What does a Virtual Machine do? Hides away the details of the hardware platform and operating system. Defines a common set of instructions.

### Big data coming soon... to an NSI near you. John Dunne. Central Statistics Office (CSO), Ireland John.Dunne@cso.ie

Big data coming soon... to an NSI near you John Dunne Central Statistics Office (CSO), Ireland John.Dunne@cso.ie Big data is beginning to be explored and exploited to inform policy making. However these

### Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

### 2. Compressing data to reduce the amount of transmitted data (e.g., to save money).

Presentation Layer The presentation layer is concerned with preserving the meaning of information sent across a network. The presentation layer may represent (encode) the data in various ways (e.g., data

Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr

### Faster deterministic integer factorisation

David Harvey (joint work with Edgar Costa, NYU) University of New South Wales 25th October 2011 The obvious mathematical breakthrough would be the development of an easy way to factor large prime numbers

### THE NUMBER OF REPRESENTATIONS OF n OF THE FORM n = x 2 2 y, x > 0, y 0

THE NUMBER OF REPRESENTATIONS OF n OF THE FORM n = x 2 2 y, x > 0, y 0 RICHARD J. MATHAR Abstract. We count solutions to the Ramanujan-Nagell equation 2 y +n = x 2 for fixed positive n. The computational

### ProTrack: A Simple Provenance-tracking Filesystem

ProTrack: A Simple Provenance-tracking Filesystem Somak Das Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology das@mit.edu Abstract Provenance describes a file

### Scalable Prefix Matching for Internet Packet Forwarding

Scalable Prefix Matching for Internet Packet Forwarding Marcel Waldvogel Computer Engineering and Networks Laboratory Institut für Technische Informatik und Kommunikationsnetze Background Internet growth

### Chapter 11 I/O Management and Disk Scheduling

Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization

### Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining

### Vulnerability Analysis of Hash Tables to Sophisticated DDoS Attacks

International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 12 (2014), pp. 1167-1173 International Research Publications House http://www. irphouse.com Vulnerability

### Data Link Layer(1) Principal service: Transferring data from the network layer of the source machine to the one of the destination machine

Data Link Layer(1) Principal service: Transferring data from the network layer of the source machine to the one of the destination machine Virtual communication versus actual communication: Specific functions

### Problem Solving With C++ Ninth Edition

CISC 1600/1610 Computer Science I Programming in C++ Professor Daniel Leeds dleeds@fordham.edu JMH 328A Introduction to programming with C++ Learn Fundamental programming concepts Key techniques Basic

### BOSTON UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES. Thesis SCALABLE COORDINATION TECHNIQUES FOR DISTRIBUTED NETWORK MONITORING MANISH RAJ SHARMA

BOSTON UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES Thesis SCALABLE COORDINATION TECHNIQUES FOR DISTRIBUTED NETWORK MONITORING by MANISH RAJ SHARMA B.E., Birla Institute of Technology, 1998 Submitted

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### SQL Server Instance-Level Benchmarks with DVDStore

SQL Server Instance-Level Benchmarks with DVDStore Dell developed a synthetic benchmark tool back that can run benchmark tests against SQL Server, Oracle, MySQL, and PostgreSQL installations. It is open-sourced

### Machine Learning using MapReduce

Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

### Introduction to Programming (in C++) Loops. Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC

Introduction to Programming (in C++) Loops Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC Example Assume the following specification: Input: read a number N > 0 Output:

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### Example. Introduction to Programming (in C++) Loops. The while statement. Write the numbers 1 N. Assume the following specification:

Example Introduction to Programming (in C++) Loops Assume the following specification: Input: read a number N > 0 Output: write the sequence 1 2 3 N (one number per line) Jordi Cortadella, Ricard Gavaldà,

### Why? A central concept in Computer Science. Algorithms are ubiquitous.

Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online