Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Size: px
Start display at page:

Download "Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman"

Transcription

1 Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

2 To motivate the Bloom-filter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It assigns these URL s to any of a number of parallel tasks; these tasks stream back the URL s they find in the links they discover on a page. It needs to filter out those URL s it has seen before. 2

3 3 A Bloom filter placed on the stream of URL s will declare that certain URL s have been seen before. Others will be declared new, and will be added to the list of URL s that need to be crawled. Unfortunately, the Bloom filter can have false positives. It can declare a URL has been seen before when it hasn t. But if it says never seen, then it is truly new.

4 A Bloom filter is an array of bits, together with a number of hash functions. The argument of each hash function is a stream element, and it returns a position in the array. Initially, all bits are 0. When input x arrives, we set to 1 the bits h(x), for each hash function h. 4

5 5 Use N = 11 bits for our filter. Stream elements = integers. Use two hash functions: h 1 (x) = Take odd-numbered bits from the right in the binary representation of x. Treat it as an integer i. Result is i modulo 11. h 2 (x) = same, but take even-numbered bits.

6 6 Stream element h 1 h 2 Filter contents = = =

7 Suppose element y appears in the stream, and we want to know if we have seen y before. Compute h(y) for each hash function y. If all the resulting bit positions are 1, say we have seen y before. If at least one of these positions is 0, say we have not seen y before. 7

8 Suppose we have the same Bloom filter as before, and we have set the filter to Lookup element y = 118 = (binary). h 1 (y) = 14 modulo 11 = 3. h 2 (y) = 5 modulo 11 = 5. Bit 5 is 1, but bit 3 is 0, so we are sure y is not in the set. 8

9 9 Probability of a false positive depends on the density of 1 s in the array and the number of hash functions. = (fraction of 1 s) # of hash functions. The number of 1 s is approximately the number of elements inserted times the number of hash functions. But collisions lower that number slightly.

10 Turning random bits from 0 to 1 is like throwing d darts at t targets, at random. How many targets are hit by at least one dart? Probability a given target is hit by a given dart = 1/t. Probability none of d darts hit a given target is (1-1/t) d. Rewrite as (1-1/t) t(d/t) ~= e -d/t. 10

11 Suppose we use an array of 1 billion bits, 5 hash functions, and we insert 100 million elements. That is, t = 10 9, and d = 5*10 8. The fraction of 0 s that remain will be e -1/2 = Density of 1 s = Probability of a false positive = (0.393) 5 =

12

13 13 Suppose Google would like to examine its stream of search queries for the past month to find out what fraction of them were unique asked only once. But to save time, we are only going to sample 1/10 th of the stream. The fraction of unique queries in the sample!= the fraction for the stream as a whole. In fact, we can t even adjust the sample s fraction to give the correct answer.

14 14 The length of the sample is 10% of the length of the whole stream. Suppose a query is unique. It has a 10% chance of being in the sample. Suppose a query occurs exactly twice in the stream. It has an 18% chance of appearing exactly once in the sample. And so on The fraction of unique queries in the stream is unpredictably large.

15 15 Our mistake: we sampled based on the position in the stream, rather than the value of the stream element. The right way: hash search queries to 10 buckets 0, 1,, 9. Sample = all search queries that hash to bucket 0. All or none of the instances of a query are selected. Therefore the fraction of unique queries in the sample is the same as for the stream as a whole.

16 Problem: What if the total sample size is limited? Solution: Hash to a large number of buckets. Adjust the set of buckets accepted for the sample, so your sample size stays within bounds. 16

17 17 Suppose we start our search-query sample at 10%, but we want to limit the size. Hash to, say, 100 buckets, 0, 1,, 99. Take for the sample those elements hashing to buckets 0 through 9. If the sample gets too big, get rid of bucket 9. Still too big, get rid of 8, and so on.

18 18 This technique generalizes to any form of data that we can see as tuples (K, V), where K is the key and V is a value. Distinction: We want our sample to be based on picking some set of keys only, not pairs. In the search-query example, the data was all key. Hash keys to some number of buckets. Sample consists of all key-value pairs with a key that goes into one of the selected buckets.

19 Data = tuples of the form (EmpID, Dept, Salary). Query: What is the average range of salaries within a department? Key = Dept. Value = (EmpID, Salary). Sample picks some departments, has salaries for all employees of that department, including its min and max salaries. 19

20

21 Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Obvious approach: maintain the set of elements seen. 21

22 22 How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate artificial pages (spam?). How many unique users visited Facebook this month? How many different pages link to each of the pages we have crawled. Useful for estimating the PageRank of these pages.

23 Real Problem: what if we do not have space to store the complete set? Estimate the count in an unbiased way. Accept that the count may be in error, but limit the probability that the error is large. 23

24 Pick a hash function h that maps each of the n elements to at least log 2 n bits. For each stream element a, let r(a) be the number of trailing 0 s in h(a). Record R = the maximum r(a) seen. Estimate = 2 R. 24

25 25 The probability that a given h(a) ends in at least i 0 s is 2 -i. If there are m different elements, the probability that R i is 1 (1-2 -i ) m. Prob. all h(a) s end in fewer than i 0 s. Prob. a given h(a) ends in fewer than i 0 s.

26 26 Since 2 -i is small, 1 - (1-2 -i ) m 1 - e -m2. If 2 i >> m, 1 - e -m2 -i 1 - (1 - m2 -i ) m/2 i 0. If 2 i << m, 1 - e -m2 -i 1. Thus, 2 R will almost always be around m. -i First 2 terms of the Taylor expansion of e x

27 27 E(2 R ) is, in principle, infinite. Probability halves when R -> R+1, but value doubles. Workaround involves using many hash functions and getting many samples. How are samples combined? Average? What if one very large value? Median? All values are a power of 2.

28 28 Partition your samples into small groups. O(log n), where n = size of universal set, suffices. Take the average within each group. Then take the median of the averages.

29 Suppose a stream has elements chosen from a set of n values. Let m i be the number of times value i occurs. The k th moment is the sum of (m i ) k over all i. 29

30 30 0 th moment = number of different elements in the stream. The problem just considered. 1 st moment = count of the numbers of elements = length of the stream. Easy to compute. 2 nd moment = surprise number = a measure of how uneven the distribution is.

31 Stream of length 100; 11 values appear. Unsurprising: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9. Surprise # = 910. Surprising: 90, 1, 1, 1, 1, 1, 1, 1,1, 1, 1. Surprise # = 8,

32 32 Works for all moments; gives an unbiased estimate. We ll just concentrate on 2 nd moment. Based on calculation of many random variables X. Each requires a count in main memory, so number is limited.

33 33 Assume stream has length n. Pick a random time to start, so that any time is equally likely. Let the chosen time have element a in the stream. X = n * ((twice the number of a s in the stream starting at the chosen time) 1). Note: store n once, count of a s for each X.

34 34 2 nd moment is Σ a (m a ) 2. E(X ) = (1/n)(Σ all times t n * (twice the number of times the stream element at time t appears from that time on) 1). = Σ a (1/n)(n)( m a -1). = Σ a (m a ) 2. Group times by the value seen Time when the last a is seen Time when penultimate a is seen Time when the first a is seen

35 We assumed there was a number n, the number of positions in the stream. But real streams go on forever, so n changes; it is the number of inputs seen so far. 35

36 36 1. The variables X have n as a factor keep n separately; just hold the count in X. 2. Suppose we can only store k counts. We cannot have one random variable X for each start-time, and must throw out some starttimes as we read the stream. Objective: each starting time t is selected with probability k/n.

37 37 Choose the first k times for k variables. When the n th element arrives (n > k), choose it with probability k/n. If you choose it, throw one of the previously stored variables out, with equal probability. Probability of each of the first n-1 positions being chosen: (n-k)/n * k/(n-1) + k/n * k/(n-1) * (k-1)/k = k/n n-th position not chosen Previously chosen n-th position chosen Previously chosen Survives

Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important

More information

Mining Data Streams. Chapter 4. 4.1 The Stream Data Model

Mining Data Streams. Chapter 4. 4.1 The Stream Data Model Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make

More information

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m. Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we

More information

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search Chapter Objectives Chapter 9 Search Algorithms Data Structures Using C++ 1 Learn the various search algorithms Explore how to implement the sequential and binary search algorithms Discover how the sequential

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

More information

Entity Resolution Fingerprints Similar News Articles. Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Entity Resolution Fingerprints Similar News Articles. Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman Entity Resolution Fingerprints Similar News Articles Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman 2 The entity-resolution problem is to examine a collection of records and

More information

Binary Number System. 16. Binary Numbers. Base 10 digits: 0 1 2 3 4 5 6 7 8 9. Base 2 digits: 0 1

Binary Number System. 16. Binary Numbers. Base 10 digits: 0 1 2 3 4 5 6 7 8 9. Base 2 digits: 0 1 Binary Number System 1 Base 10 digits: 0 1 2 3 4 5 6 7 8 9 Base 2 digits: 0 1 Recall that in base 10, the digits of a number are just coefficients of powers of the base (10): 417 = 4 * 10 2 + 1 * 10 1

More information

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2) Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse

More information

Lecture 2 February 12, 2003

Lecture 2 February 12, 2003 6.897: Advanced Data Structures Spring 003 Prof. Erik Demaine Lecture February, 003 Scribe: Jeff Lindy Overview In the last lecture we considered the successor problem for a bounded universe of size u.

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Analysis of Binary Search algorithm and Selection Sort algorithm

Analysis of Binary Search algorithm and Selection Sort algorithm Analysis of Binary Search algorithm and Selection Sort algorithm In this section we shall take up two representative problems in computer science, work out the algorithms based on the best strategy to

More information

CS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

CS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions CS 2112 Spring 2014 Assignment 3 Data Structures and Web Filtering Due: March 4, 2014 11:59 PM Implementing spam blacklists and web filters requires matching candidate domain names and URLs very rapidly

More information

Multi-dimensional index structures Part I: motivation

Multi-dimensional index structures Part I: motivation Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for

More information

Encoding Text with a Small Alphabet

Encoding Text with a Small Alphabet Chapter 2 Encoding Text with a Small Alphabet Given the nature of the Internet, we can break the process of understanding how information is transmitted into two components. First, we have to figure out

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n) Hash Tables Tables so far set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n) Worst O(lg n) O(lg n) O(lg n) Table naïve array implementation

More information

The Exponential Distribution

The Exponential Distribution 21 The Exponential Distribution From Discrete-Time to Continuous-Time: In Chapter 6 of the text we will be considering Markov processes in continuous time. In a sense, we already have a very good understanding

More information

MapReduce and the New Software Stack

MapReduce and the New Software Stack 20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the

More information

The Advantages and Disadvantages of Network Computing Nodes

The Advantages and Disadvantages of Network Computing Nodes Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

More information

2.1. Inductive Reasoning EXAMPLE A

2.1. Inductive Reasoning EXAMPLE A CONDENSED LESSON 2.1 Inductive Reasoning In this lesson you will Learn how inductive reasoning is used in science and mathematics Use inductive reasoning to make conjectures about sequences of numbers

More information

Cuckoo Filter: Practically Better Than Bloom

Cuckoo Filter: Practically Better Than Bloom Cuckoo Filter: Practically Better Than Bloom Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Carnegie Mellon University, Intel Labs, Harvard University {binfan,dga}@cs.cmu.edu, michael.e.kaminsky@intel.com,

More information

Cassandra A Decentralized, Structured Storage System

Cassandra A Decentralized, Structured Storage System Cassandra A Decentralized, Structured Storage System Avinash Lakshman and Prashant Malik Facebook Published: April 2010, Volume 44, Issue 2 Communications of the ACM http://dl.acm.org/citation.cfm?id=1773922

More information

Chapter 3. if 2 a i then location: = i. Page 40

Chapter 3. if 2 a i then location: = i. Page 40 Chapter 3 1. Describe an algorithm that takes a list of n integers a 1,a 2,,a n and finds the number of integers each greater than five in the list. Ans: procedure greaterthanfive(a 1,,a n : integers)

More information

Discrete Math in Computer Science Homework 7 Solutions (Max Points: 80)

Discrete Math in Computer Science Homework 7 Solutions (Max Points: 80) Discrete Math in Computer Science Homework 7 Solutions (Max Points: 80) CS 30, Winter 2016 by Prasad Jayanti 1. (10 points) Here is the famous Monty Hall Puzzle. Suppose you are on a game show, and you

More information

Lecture 1: Course overview, circuits, and formulas

Lecture 1: Course overview, circuits, and formulas Lecture 1: Course overview, circuits, and formulas Topics in Complexity Theory and Pseudorandomness (Spring 2013) Rutgers University Swastik Kopparty Scribes: John Kim, Ben Lund 1 Course Information Swastik

More information

MODELING RANDOMNESS IN NETWORK TRAFFIC

MODELING RANDOMNESS IN NETWORK TRAFFIC MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of

More information

Question 1 Formatted: Formatted: Formatted: Formatted:

Question 1 Formatted: Formatted: Formatted: Formatted: In many situations in life, we are presented with opportunities to evaluate probabilities of events occurring and make judgments and decisions from this information. In this paper, we will explore four

More information

Binomial random variables

Binomial random variables Binomial and Poisson Random Variables Solutions STAT-UB.0103 Statistics for Business Control and Regression Models Binomial random variables 1. A certain coin has a 5% of landing heads, and a 75% chance

More information

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Why? A central concept in Computer Science. Algorithms are ubiquitous. Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

More information

Lectures 5-6: Taylor Series

Lectures 5-6: Taylor Series Math 1d Instructor: Padraic Bartlett Lectures 5-: Taylor Series Weeks 5- Caltech 213 1 Taylor Polynomials and Series As we saw in week 4, power series are remarkably nice objects to work with. In particular,

More information

Lecture 6 Online and streaming algorithms for clustering

Lecture 6 Online and streaming algorithms for clustering CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

More information

CIS 631 Database Management Systems Sample Final Exam

CIS 631 Database Management Systems Sample Final Exam CIS 631 Database Management Systems Sample Final Exam 1. (25 points) Match the items from the left column with those in the right and place the letters in the empty slots. k 1. Single-level index files

More information

Lecture 4 Online and streaming algorithms for clustering

Lecture 4 Online and streaming algorithms for clustering CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

More information

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

More information

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case. Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machine-independent measure of efficiency Growth rate Complexity

More information

Counting Problems in Flash Storage Design

Counting Problems in Flash Storage Design Flash Talk Counting Problems in Flash Storage Design Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu NVRAMOS 09, Jeju, Korea, October 2009-1-

More information

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman 2 Intuition: solve the recursive equation: a page is important if important pages link to it. Technically, importance = the principal

More information

Chapter 2 Data Storage

Chapter 2 Data Storage Chapter 2 22 CHAPTER 2. DATA STORAGE 2.1. THE MEMORY HIERARCHY 23 26 CHAPTER 2. DATA STORAGE main memory, yet is essentially random-access, with relatively small differences Figure 2.4: A typical

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

B490 Mining the Big Data. 0 Introduction

B490 Mining the Big Data. 0 Introduction B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A

More information

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Class Outline Link-based page importance measures Why

More information

Practical Survey on Hash Tables. Aurelian Țuțuianu

Practical Survey on Hash Tables. Aurelian Țuțuianu Practical Survey on Hash Tables Aurelian Țuțuianu In memoriam Mihai Pătraşcu (17 July 1982 5 June 2012) I have no intention to ever teach computer science. I want to teach the love for computer science,

More information

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven) Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven) Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii)

More information

Theorem (informal statement): There are no extendible methods in David Chalmers s sense unless P = NP.

Theorem (informal statement): There are no extendible methods in David Chalmers s sense unless P = NP. Theorem (informal statement): There are no extendible methods in David Chalmers s sense unless P = NP. Explication: In his paper, The Singularity: A philosophical analysis, David Chalmers defines an extendible

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014. Data Structures for Big Data: Bloom Filter Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014. is relative is not defined by a specific number of TB, PB, EB is when it becomes big for you is

More information

Random Fibonacci-type Sequences in Online Gambling

Random Fibonacci-type Sequences in Online Gambling Random Fibonacci-type Sequences in Online Gambling Adam Biello, CJ Cacciatore, Logan Thomas Department of Mathematics CSUMS Advisor: Alfa Heryudono Department of Mathematics University of Massachusetts

More information

An example of a computable

An example of a computable An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept

More information

6.042/18.062J Mathematics for Computer Science December 12, 2006 Tom Leighton and Ronitt Rubinfeld. Random Walks

6.042/18.062J Mathematics for Computer Science December 12, 2006 Tom Leighton and Ronitt Rubinfeld. Random Walks 6.042/8.062J Mathematics for Comuter Science December 2, 2006 Tom Leighton and Ronitt Rubinfeld Lecture Notes Random Walks Gambler s Ruin Today we re going to talk about one-dimensional random walks. In

More information

DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT

DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT By GROUP Avadhut Gurjar Mohsin Patel Shraddha Pandhe Page 1 Contents 1. Introduction... 3 2. DNS Recursive Query Mechanism:...5 2.1. Client

More information

Find-The-Number. 1 Find-The-Number With Comps

Find-The-Number. 1 Find-The-Number With Comps Find-The-Number 1 Find-The-Number With Comps Consider the following two-person game, which we call Find-The-Number with Comps. Player A (for answerer) has a number x between 1 and 1000. Player Q (for questioner)

More information

Approximating functions by Taylor Polynomials.

Approximating functions by Taylor Polynomials. Chapter 4 Approximating functions by Taylor Polynomials. 4.1 Linear Approximations We have already seen how to approximate a function using its tangent line. This was the key idea in Euler s method. If

More information

Database Security. Soon M. Chung Department of Computer Science and Engineering Wright State University schung@cs.wright.

Database Security. Soon M. Chung Department of Computer Science and Engineering Wright State University schung@cs.wright. Database Security Soon M. Chung Department of Computer Science and Engineering Wright State University schung@cs.wright.edu 937-775-5119 Goals of DB Security Integrity: Only authorized users should be

More information

FAWN - a Fast Array of Wimpy Nodes

FAWN - a Fast Array of Wimpy Nodes University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed

More information

Rethinking SIMD Vectorization for In-Memory Databases

Rethinking SIMD Vectorization for In-Memory Databases SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

More information

OSTROWSKI FOR NUMBER FIELDS

OSTROWSKI FOR NUMBER FIELDS OSTROWSKI FOR NUMBER FIELDS KEITH CONRAD Ostrowski classified the nontrivial absolute values on Q: up to equivalence, they are the usual (archimedean) absolute value and the p-adic absolute values for

More information

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE

More information

SMALL INDEX LARGE INDEX (SILT)

SMALL INDEX LARGE INDEX (SILT) Wayne State University ECE 7650: Scalable and Secure Internet Services and Architecture SMALL INDEX LARGE INDEX (SILT) A Memory Efficient High Performance Key Value Store QA REPORT Instructor: Dr. Song

More information

NETWORK SECURITY: How do servers store passwords?

NETWORK SECURITY: How do servers store passwords? NETWORK SECURITY: How do servers store passwords? Servers avoid storing the passwords in plaintext on their servers to avoid possible intruders to gain all their users passwords. A hash of each password

More information

Who am I? Copyright 2014, Oracle and/or its affiliates. All rights reserved. 3

Who am I? Copyright 2014, Oracle and/or its affiliates. All rights reserved. 3 Oracle Database In-Memory Power the Real-Time Enterprise Saurabh K. Gupta Principal Technologist, Database Product Management Who am I? Principal Technologist, Database Product Management at Oracle Author

More information

MapReduce: Algorithm Design Patterns

MapReduce: Algorithm Design Patterns Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources

More information

KNOM Tutorial 2003. Internet Traffic Measurement and Analysis. Sue Bok Moon Dept. of Computer Science

KNOM Tutorial 2003. Internet Traffic Measurement and Analysis. Sue Bok Moon Dept. of Computer Science KNOM Tutorial 2003 Internet Traffic Measurement and Analysis Sue Bok Moon Dept. of Computer Science Overview Definition of Traffic Matrix 4Traffic demand, delay, loss Applications of Traffic Matrix 4Engineering,

More information

Ex. 2.1 (Davide Basilio Bartolini)

Ex. 2.1 (Davide Basilio Bartolini) ECE 54: Elements of Information Theory, Fall 00 Homework Solutions Ex.. (Davide Basilio Bartolini) Text Coin Flips. A fair coin is flipped until the first head occurs. Let X denote the number of flips

More information

Bloom Filters. Christian Antognini Trivadis AG Zürich, Switzerland

Bloom Filters. Christian Antognini Trivadis AG Zürich, Switzerland Bloom Filters Christian Antognini Trivadis AG Zürich, Switzerland Oracle Database uses bloom filters in various situations. Unfortunately, no information about their usage is available in Oracle documentation.

More information

Point Biserial Correlation Tests

Point Biserial Correlation Tests Chapter 807 Point Biserial Correlation Tests Introduction The point biserial correlation coefficient (ρ in this chapter) is the product-moment correlation calculated between a continuous random variable

More information

Big Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs

Big Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs 1 Big Data Analytics Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs Montevideo, 22 nd November 4 th December, 2015 INFORMATIQUE

More information

Load Balancing Between Heterogenous Computing Clusters

Load Balancing Between Heterogenous Computing Clusters Load Balancing Between Heterogenous Computing Clusters Siu-Cheung Chau Dept. of Physics and Computing, Wilfrid Laurier University, Waterloo, Ontario, Canada, N2L 3C5 e-mail: schau@wlu.ca Ada Wai-Chee Fu

More information

Assignment 2: More MapReduce with Hadoop

Assignment 2: More MapReduce with Hadoop Assignment 2: More MapReduce with Hadoop Jean-Pierre Lozi February 5, 2015 Provided files following URL: An archive that contains all files you will need for this assignment can be found at the http://sfu.ca/~jlozi/cmpt732/assignment2.tar.gz

More information

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * * Binary Heaps A binary heap is another data structure. It implements a priority queue. Priority Queue has the following operations: isempty add (with priority) remove (highest priority) peek (at highest

More information

Math 55: Discrete Mathematics

Math 55: Discrete Mathematics Math 55: Discrete Mathematics UC Berkeley, Fall 2011 Homework # 5, due Wednesday, February 22 5.1.4 Let P (n) be the statement that 1 3 + 2 3 + + n 3 = (n(n + 1)/2) 2 for the positive integer n. a) What

More information

Statistics 104: Section 6!

Statistics 104: Section 6! Page 1 Statistics 104: Section 6! TF: Deirdre (say: Dear-dra) Bloome Email: dbloome@fas.harvard.edu Section Times Thursday 2pm-3pm in SC 109, Thursday 5pm-6pm in SC 705 Office Hours: Thursday 6pm-7pm SC

More information

Kafka & Redis for Big Data Solutions

Kafka & Redis for Big Data Solutions Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)

More information

Load Balancing. Load Balancing 1 / 24

Load Balancing. Load Balancing 1 / 24 Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

More information

Math/Stats 342: Solutions to Homework

Math/Stats 342: Solutions to Homework Math/Stats 342: Solutions to Homework Steven Miller (sjm1@williams.edu) November 17, 2011 Abstract Below are solutions / sketches of solutions to the homework problems from Math/Stats 342: Probability

More information

Practice Problems for Homework #6. Normal distribution and Central Limit Theorem.

Practice Problems for Homework #6. Normal distribution and Central Limit Theorem. Practice Problems for Homework #6. Normal distribution and Central Limit Theorem. 1. Read Section 3.4.6 about the Normal distribution and Section 4.7 about the Central Limit Theorem. 2. Solve the practice

More information

Relational Database: Additional Operations on Relations; SQL

Relational Database: Additional Operations on Relations; SQL Relational Database: Additional Operations on Relations; SQL Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Overview The course packet

More information

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data

More information

Factoring & Primality

Factoring & Primality Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount

More information

Interactive Machine Learning. Maria-Florina Balcan

Interactive Machine Learning. Maria-Florina Balcan Interactive Machine Learning Maria-Florina Balcan Machine Learning Image Classification Document Categorization Speech Recognition Protein Classification Branch Prediction Fraud Detection Spam Detection

More information

(Refer Slide Time: 01.26)

(Refer Slide Time: 01.26) Discrete Mathematical Structures Dr. Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture # 27 Pigeonhole Principle In the next few lectures

More information

Lecture 1: Data Storage & Index

Lecture 1: Data Storage & Index Lecture 1: Data Storage & Index R&G Chapter 8-11 Concurrency control Query Execution and Optimization Relational Operators File & Access Methods Buffer Management Disk Space Management Recovery Manager

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Week 1: Introduction to Online Learning

Week 1: Introduction to Online Learning Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider

More information

BIG DATA, MAPREDUCE & HADOOP

BIG DATA, MAPREDUCE & HADOOP BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS

More information

Physical Data Organization

Physical Data Organization Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

More information

IP Subnetting: Practical Subnet Design and Address Determination Example

IP Subnetting: Practical Subnet Design and Address Determination Example IP Subnetting: Practical Subnet Design and Address Determination Example When educators ask students what they consider to be the most confusing aspect in learning about networking, many say that it is

More information

arxiv:1112.0829v1 [math.pr] 5 Dec 2011

arxiv:1112.0829v1 [math.pr] 5 Dec 2011 How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

More information

Binomial random variables (Review)

Binomial random variables (Review) Poisson / Empirical Rule Approximations / Hypergeometric Solutions STAT-UB.3 Statistics for Business Control and Regression Models Binomial random variables (Review. Suppose that you are rolling a die

More information

Session 7 Fractions and Decimals

Session 7 Fractions and Decimals Key Terms in This Session Session 7 Fractions and Decimals Previously Introduced prime number rational numbers New in This Session period repeating decimal terminating decimal Introduction In this session,

More information

Numbers 101: Cost and Value Over Time

Numbers 101: Cost and Value Over Time The Anderson School at UCLA POL 2000-09 Numbers 101: Cost and Value Over Time Copyright 2000 by Richard P. Rumelt. We use the tool called discounting to compare money amounts received or paid at different

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances

Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances Jialin Zhang Tsinghua University zhanggl02@mails.tsinghua.edu.cn Wei Chen Microsoft Research Asia weic@microsoft.com Abstract

More information

Stanford Math Circle: Sunday, May 9, 2010 Square-Triangular Numbers, Pell s Equation, and Continued Fractions

Stanford Math Circle: Sunday, May 9, 2010 Square-Triangular Numbers, Pell s Equation, and Continued Fractions Stanford Math Circle: Sunday, May 9, 00 Square-Triangular Numbers, Pell s Equation, and Continued Fractions Recall that triangular numbers are numbers of the form T m = numbers that can be arranged in

More information

Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

More information

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++ Answer the following 1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++ 2) Which data structure is needed to convert infix notations to postfix notations? Stack 3) The

More information

Data Structures. Algorithm Performance and Big O Analysis

Data Structures. Algorithm Performance and Big O Analysis Data Structures Algorithm Performance and Big O Analysis What s an Algorithm? a clearly specified set of instructions to be followed to solve a problem. In essence: A computer program. In detail: Defined

More information