Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman
|
|
- Silas Simmons
- 8 years ago
- Views:
Transcription
1 Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman
2 To motivate the Bloom-filter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It assigns these URL s to any of a number of parallel tasks; these tasks stream back the URL s they find in the links they discover on a page. It needs to filter out those URL s it has seen before. 2
3 3 A Bloom filter placed on the stream of URL s will declare that certain URL s have been seen before. Others will be declared new, and will be added to the list of URL s that need to be crawled. Unfortunately, the Bloom filter can have false positives. It can declare a URL has been seen before when it hasn t. But if it says never seen, then it is truly new.
4 A Bloom filter is an array of bits, together with a number of hash functions. The argument of each hash function is a stream element, and it returns a position in the array. Initially, all bits are 0. When input x arrives, we set to 1 the bits h(x), for each hash function h. 4
5 5 Use N = 11 bits for our filter. Stream elements = integers. Use two hash functions: h 1 (x) = Take odd-numbered bits from the right in the binary representation of x. Treat it as an integer i. Result is i modulo 11. h 2 (x) = same, but take even-numbered bits.
6 6 Stream element h 1 h 2 Filter contents = = =
7 Suppose element y appears in the stream, and we want to know if we have seen y before. Compute h(y) for each hash function y. If all the resulting bit positions are 1, say we have seen y before. If at least one of these positions is 0, say we have not seen y before. 7
8 Suppose we have the same Bloom filter as before, and we have set the filter to Lookup element y = 118 = (binary). h 1 (y) = 14 modulo 11 = 3. h 2 (y) = 5 modulo 11 = 5. Bit 5 is 1, but bit 3 is 0, so we are sure y is not in the set. 8
9 9 Probability of a false positive depends on the density of 1 s in the array and the number of hash functions. = (fraction of 1 s) # of hash functions. The number of 1 s is approximately the number of elements inserted times the number of hash functions. But collisions lower that number slightly.
10 Turning random bits from 0 to 1 is like throwing d darts at t targets, at random. How many targets are hit by at least one dart? Probability a given target is hit by a given dart = 1/t. Probability none of d darts hit a given target is (1-1/t) d. Rewrite as (1-1/t) t(d/t) ~= e -d/t. 10
11 Suppose we use an array of 1 billion bits, 5 hash functions, and we insert 100 million elements. That is, t = 10 9, and d = 5*10 8. The fraction of 0 s that remain will be e -1/2 = Density of 1 s = Probability of a false positive = (0.393) 5 =
12
13 13 Suppose Google would like to examine its stream of search queries for the past month to find out what fraction of them were unique asked only once. But to save time, we are only going to sample 1/10 th of the stream. The fraction of unique queries in the sample!= the fraction for the stream as a whole. In fact, we can t even adjust the sample s fraction to give the correct answer.
14 14 The length of the sample is 10% of the length of the whole stream. Suppose a query is unique. It has a 10% chance of being in the sample. Suppose a query occurs exactly twice in the stream. It has an 18% chance of appearing exactly once in the sample. And so on The fraction of unique queries in the stream is unpredictably large.
15 15 Our mistake: we sampled based on the position in the stream, rather than the value of the stream element. The right way: hash search queries to 10 buckets 0, 1,, 9. Sample = all search queries that hash to bucket 0. All or none of the instances of a query are selected. Therefore the fraction of unique queries in the sample is the same as for the stream as a whole.
16 Problem: What if the total sample size is limited? Solution: Hash to a large number of buckets. Adjust the set of buckets accepted for the sample, so your sample size stays within bounds. 16
17 17 Suppose we start our search-query sample at 10%, but we want to limit the size. Hash to, say, 100 buckets, 0, 1,, 99. Take for the sample those elements hashing to buckets 0 through 9. If the sample gets too big, get rid of bucket 9. Still too big, get rid of 8, and so on.
18 18 This technique generalizes to any form of data that we can see as tuples (K, V), where K is the key and V is a value. Distinction: We want our sample to be based on picking some set of keys only, not pairs. In the search-query example, the data was all key. Hash keys to some number of buckets. Sample consists of all key-value pairs with a key that goes into one of the selected buckets.
19 Data = tuples of the form (EmpID, Dept, Salary). Query: What is the average range of salaries within a department? Key = Dept. Value = (EmpID, Salary). Sample picks some departments, has salaries for all employees of that department, including its min and max salaries. 19
20
21 Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Obvious approach: maintain the set of elements seen. 21
22 22 How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate artificial pages (spam?). How many unique users visited Facebook this month? How many different pages link to each of the pages we have crawled. Useful for estimating the PageRank of these pages.
23 Real Problem: what if we do not have space to store the complete set? Estimate the count in an unbiased way. Accept that the count may be in error, but limit the probability that the error is large. 23
24 Pick a hash function h that maps each of the n elements to at least log 2 n bits. For each stream element a, let r(a) be the number of trailing 0 s in h(a). Record R = the maximum r(a) seen. Estimate = 2 R. 24
25 25 The probability that a given h(a) ends in at least i 0 s is 2 -i. If there are m different elements, the probability that R i is 1 (1-2 -i ) m. Prob. all h(a) s end in fewer than i 0 s. Prob. a given h(a) ends in fewer than i 0 s.
26 26 Since 2 -i is small, 1 - (1-2 -i ) m 1 - e -m2. If 2 i >> m, 1 - e -m2 -i 1 - (1 - m2 -i ) m/2 i 0. If 2 i << m, 1 - e -m2 -i 1. Thus, 2 R will almost always be around m. -i First 2 terms of the Taylor expansion of e x
27 27 E(2 R ) is, in principle, infinite. Probability halves when R -> R+1, but value doubles. Workaround involves using many hash functions and getting many samples. How are samples combined? Average? What if one very large value? Median? All values are a power of 2.
28 28 Partition your samples into small groups. O(log n), where n = size of universal set, suffices. Take the average within each group. Then take the median of the averages.
29 Suppose a stream has elements chosen from a set of n values. Let m i be the number of times value i occurs. The k th moment is the sum of (m i ) k over all i. 29
30 30 0 th moment = number of different elements in the stream. The problem just considered. 1 st moment = count of the numbers of elements = length of the stream. Easy to compute. 2 nd moment = surprise number = a measure of how uneven the distribution is.
31 Stream of length 100; 11 values appear. Unsurprising: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9. Surprise # = 910. Surprising: 90, 1, 1, 1, 1, 1, 1, 1,1, 1, 1. Surprise # = 8,
32 32 Works for all moments; gives an unbiased estimate. We ll just concentrate on 2 nd moment. Based on calculation of many random variables X. Each requires a count in main memory, so number is limited.
33 33 Assume stream has length n. Pick a random time to start, so that any time is equally likely. Let the chosen time have element a in the stream. X = n * ((twice the number of a s in the stream starting at the chosen time) 1). Note: store n once, count of a s for each X.
34 34 2 nd moment is Σ a (m a ) 2. E(X ) = (1/n)(Σ all times t n * (twice the number of times the stream element at time t appears from that time on) 1). = Σ a (1/n)(n)( m a -1). = Σ a (m a ) 2. Group times by the value seen Time when the last a is seen Time when penultimate a is seen Time when the first a is seen
35 We assumed there was a number n, the number of positions in the stream. But real streams go on forever, so n changes; it is the number of inputs seen so far. 35
36 36 1. The variables X have n as a factor keep n separately; just hold the count in X. 2. Suppose we can only store k counts. We cannot have one random variable X for each start-time, and must throw out some starttimes as we read the stream. Objective: each starting time t is selected with probability k/n.
37 37 Choose the first k times for k variables. When the n th element arrives (n > k), choose it with probability k/n. If you choose it, throw one of the previously stored variables out, with equal probability. Probability of each of the first n-1 positions being chosen: (n-k)/n * k/(n-1) + k/n * k/(n-1) * (k-1)/k = k/n n-th position not chosen Previously chosen n-th position chosen Previously chosen Survives
Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman
Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important
More informationMining Data Streams. Chapter 4. 4.1 The Stream Data Model
Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make
More informationUniversal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.
Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we
More informationChapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search
Chapter Objectives Chapter 9 Search Algorithms Data Structures Using C++ 1 Learn the various search algorithms Explore how to implement the sequential and binary search algorithms Discover how the sequential
More informationBig Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
More informationEntity Resolution Fingerprints Similar News Articles. Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman
Entity Resolution Fingerprints Similar News Articles Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman 2 The entity-resolution problem is to examine a collection of records and
More informationBinary Number System. 16. Binary Numbers. Base 10 digits: 0 1 2 3 4 5 6 7 8 9. Base 2 digits: 0 1
Binary Number System 1 Base 10 digits: 0 1 2 3 4 5 6 7 8 9 Base 2 digits: 0 1 Recall that in base 10, the digits of a number are just coefficients of powers of the base (10): 417 = 4 * 10 2 + 1 * 10 1
More informationSorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)
Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse
More informationLecture 2 February 12, 2003
6.897: Advanced Data Structures Spring 003 Prof. Erik Demaine Lecture February, 003 Scribe: Jeff Lindy Overview In the last lecture we considered the successor problem for a bounded universe of size u.
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationAnalysis of Binary Search algorithm and Selection Sort algorithm
Analysis of Binary Search algorithm and Selection Sort algorithm In this section we shall take up two representative problems in computer science, work out the algorithms based on the best strategy to
More informationCS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions
CS 2112 Spring 2014 Assignment 3 Data Structures and Web Filtering Due: March 4, 2014 11:59 PM Implementing spam blacklists and web filters requires matching candidate domain names and URLs very rapidly
More informationMulti-dimensional index structures Part I: motivation
Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for
More informationEncoding Text with a Small Alphabet
Chapter 2 Encoding Text with a Small Alphabet Given the nature of the Internet, we can break the process of understanding how information is transmitted into two components. First, we have to figure out
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
More informationTables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)
Hash Tables Tables so far set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n) Worst O(lg n) O(lg n) O(lg n) Table naïve array implementation
More informationThe Exponential Distribution
21 The Exponential Distribution From Discrete-Time to Continuous-Time: In Chapter 6 of the text we will be considering Markov processes in continuous time. In a sense, we already have a very good understanding
More informationMapReduce and the New Software Stack
20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the
More informationThe Advantages and Disadvantages of Network Computing Nodes
Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node
More informationBig Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),
More information2.1. Inductive Reasoning EXAMPLE A
CONDENSED LESSON 2.1 Inductive Reasoning In this lesson you will Learn how inductive reasoning is used in science and mathematics Use inductive reasoning to make conjectures about sequences of numbers
More informationCuckoo Filter: Practically Better Than Bloom
Cuckoo Filter: Practically Better Than Bloom Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Carnegie Mellon University, Intel Labs, Harvard University {binfan,dga}@cs.cmu.edu, michael.e.kaminsky@intel.com,
More informationCassandra A Decentralized, Structured Storage System
Cassandra A Decentralized, Structured Storage System Avinash Lakshman and Prashant Malik Facebook Published: April 2010, Volume 44, Issue 2 Communications of the ACM http://dl.acm.org/citation.cfm?id=1773922
More informationChapter 3. if 2 a i then location: = i. Page 40
Chapter 3 1. Describe an algorithm that takes a list of n integers a 1,a 2,,a n and finds the number of integers each greater than five in the list. Ans: procedure greaterthanfive(a 1,,a n : integers)
More informationDiscrete Math in Computer Science Homework 7 Solutions (Max Points: 80)
Discrete Math in Computer Science Homework 7 Solutions (Max Points: 80) CS 30, Winter 2016 by Prasad Jayanti 1. (10 points) Here is the famous Monty Hall Puzzle. Suppose you are on a game show, and you
More informationLecture 1: Course overview, circuits, and formulas
Lecture 1: Course overview, circuits, and formulas Topics in Complexity Theory and Pseudorandomness (Spring 2013) Rutgers University Swastik Kopparty Scribes: John Kim, Ben Lund 1 Course Information Swastik
More informationMODELING RANDOMNESS IN NETWORK TRAFFIC
MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of
More informationQuestion 1 Formatted: Formatted: Formatted: Formatted:
In many situations in life, we are presented with opportunities to evaluate probabilities of events occurring and make judgments and decisions from this information. In this paper, we will explore four
More informationBinomial random variables
Binomial and Poisson Random Variables Solutions STAT-UB.0103 Statistics for Business Control and Regression Models Binomial random variables 1. A certain coin has a 5% of landing heads, and a 75% chance
More informationWhy? A central concept in Computer Science. Algorithms are ubiquitous.
Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online
More informationLectures 5-6: Taylor Series
Math 1d Instructor: Padraic Bartlett Lectures 5-: Taylor Series Weeks 5- Caltech 213 1 Taylor Polynomials and Series As we saw in week 4, power series are remarkably nice objects to work with. In particular,
More informationLecture 6 Online and streaming algorithms for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
More informationCIS 631 Database Management Systems Sample Final Exam
CIS 631 Database Management Systems Sample Final Exam 1. (25 points) Match the items from the left column with those in the right and place the letters in the empty slots. k 1. Single-level index files
More informationLecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
More informationLoad Balancing in MapReduce Based on Scalable Cardinality Estimates
Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching
More informationEfficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.
Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machine-independent measure of efficiency Growth rate Complexity
More informationCounting Problems in Flash Storage Design
Flash Talk Counting Problems in Flash Storage Design Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu NVRAMOS 09, Jeju, Korea, October 2009-1-
More informationCloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman 2 Intuition: solve the recursive equation: a page is important if important pages link to it. Technically, importance = the principal
More informationChapter 2 Data Storage
Chapter 2 22 CHAPTER 2. DATA STORAGE 2.1. THE MEMORY HIERARCHY 23 26 CHAPTER 2. DATA STORAGE main memory, yet is essentially random-access, with relatively small differences Figure 2.4: A typical
More informationHypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
More informationB490 Mining the Big Data. 0 Introduction
B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A
More informationBig Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time
Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Class Outline Link-based page importance measures Why
More informationPractical Survey on Hash Tables. Aurelian Țuțuianu
Practical Survey on Hash Tables Aurelian Țuțuianu In memoriam Mihai Pătraşcu (17 July 1982 5 June 2012) I have no intention to ever teach computer science. I want to teach the love for computer science,
More informationNimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
More informationChapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationChapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
More informationAlgorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)
Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven) Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii)
More informationTheorem (informal statement): There are no extendible methods in David Chalmers s sense unless P = NP.
Theorem (informal statement): There are no extendible methods in David Chalmers s sense unless P = NP. Explication: In his paper, The Singularity: A philosophical analysis, David Chalmers defines an extendible
More informationSo today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
More informationData Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.
Data Structures for Big Data: Bloom Filter Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014. is relative is not defined by a specific number of TB, PB, EB is when it becomes big for you is
More informationRandom Fibonacci-type Sequences in Online Gambling
Random Fibonacci-type Sequences in Online Gambling Adam Biello, CJ Cacciatore, Logan Thomas Department of Mathematics CSUMS Advisor: Alfa Heryudono Department of Mathematics University of Massachusetts
More informationAn example of a computable
An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept
More information6.042/18.062J Mathematics for Computer Science December 12, 2006 Tom Leighton and Ronitt Rubinfeld. Random Walks
6.042/8.062J Mathematics for Comuter Science December 2, 2006 Tom Leighton and Ronitt Rubinfeld Lecture Notes Random Walks Gambler s Ruin Today we re going to talk about one-dimensional random walks. In
More informationDNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT
DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT By GROUP Avadhut Gurjar Mohsin Patel Shraddha Pandhe Page 1 Contents 1. Introduction... 3 2. DNS Recursive Query Mechanism:...5 2.1. Client
More informationFind-The-Number. 1 Find-The-Number With Comps
Find-The-Number 1 Find-The-Number With Comps Consider the following two-person game, which we call Find-The-Number with Comps. Player A (for answerer) has a number x between 1 and 1000. Player Q (for questioner)
More informationApproximating functions by Taylor Polynomials.
Chapter 4 Approximating functions by Taylor Polynomials. 4.1 Linear Approximations We have already seen how to approximate a function using its tangent line. This was the key idea in Euler s method. If
More informationDatabase Security. Soon M. Chung Department of Computer Science and Engineering Wright State University schung@cs.wright.
Database Security Soon M. Chung Department of Computer Science and Engineering Wright State University schung@cs.wright.edu 937-775-5119 Goals of DB Security Integrity: Only authorized users should be
More informationFAWN - a Fast Array of Wimpy Nodes
University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed
More informationRethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationOSTROWSKI FOR NUMBER FIELDS
OSTROWSKI FOR NUMBER FIELDS KEITH CONRAD Ostrowski classified the nontrivial absolute values on Q: up to equivalence, they are the usual (archimedean) absolute value and the p-adic absolute values for
More informationIMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
More informationSMALL INDEX LARGE INDEX (SILT)
Wayne State University ECE 7650: Scalable and Secure Internet Services and Architecture SMALL INDEX LARGE INDEX (SILT) A Memory Efficient High Performance Key Value Store QA REPORT Instructor: Dr. Song
More informationNETWORK SECURITY: How do servers store passwords?
NETWORK SECURITY: How do servers store passwords? Servers avoid storing the passwords in plaintext on their servers to avoid possible intruders to gain all their users passwords. A hash of each password
More informationWho am I? Copyright 2014, Oracle and/or its affiliates. All rights reserved. 3
Oracle Database In-Memory Power the Real-Time Enterprise Saurabh K. Gupta Principal Technologist, Database Product Management Who am I? Principal Technologist, Database Product Management at Oracle Author
More informationMapReduce: Algorithm Design Patterns
Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources
More informationKNOM Tutorial 2003. Internet Traffic Measurement and Analysis. Sue Bok Moon Dept. of Computer Science
KNOM Tutorial 2003 Internet Traffic Measurement and Analysis Sue Bok Moon Dept. of Computer Science Overview Definition of Traffic Matrix 4Traffic demand, delay, loss Applications of Traffic Matrix 4Engineering,
More informationEx. 2.1 (Davide Basilio Bartolini)
ECE 54: Elements of Information Theory, Fall 00 Homework Solutions Ex.. (Davide Basilio Bartolini) Text Coin Flips. A fair coin is flipped until the first head occurs. Let X denote the number of flips
More informationBloom Filters. Christian Antognini Trivadis AG Zürich, Switzerland
Bloom Filters Christian Antognini Trivadis AG Zürich, Switzerland Oracle Database uses bloom filters in various situations. Unfortunately, no information about their usage is available in Oracle documentation.
More informationPoint Biserial Correlation Tests
Chapter 807 Point Biserial Correlation Tests Introduction The point biserial correlation coefficient (ρ in this chapter) is the product-moment correlation calculated between a continuous random variable
More informationBig Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs
1 Big Data Analytics Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs Montevideo, 22 nd November 4 th December, 2015 INFORMATIQUE
More informationLoad Balancing Between Heterogenous Computing Clusters
Load Balancing Between Heterogenous Computing Clusters Siu-Cheung Chau Dept. of Physics and Computing, Wilfrid Laurier University, Waterloo, Ontario, Canada, N2L 3C5 e-mail: schau@wlu.ca Ada Wai-Chee Fu
More informationAssignment 2: More MapReduce with Hadoop
Assignment 2: More MapReduce with Hadoop Jean-Pierre Lozi February 5, 2015 Provided files following URL: An archive that contains all files you will need for this assignment can be found at the http://sfu.ca/~jlozi/cmpt732/assignment2.tar.gz
More informationBinary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *
Binary Heaps A binary heap is another data structure. It implements a priority queue. Priority Queue has the following operations: isempty add (with priority) remove (highest priority) peek (at highest
More informationMath 55: Discrete Mathematics
Math 55: Discrete Mathematics UC Berkeley, Fall 2011 Homework # 5, due Wednesday, February 22 5.1.4 Let P (n) be the statement that 1 3 + 2 3 + + n 3 = (n(n + 1)/2) 2 for the positive integer n. a) What
More informationStatistics 104: Section 6!
Page 1 Statistics 104: Section 6! TF: Deirdre (say: Dear-dra) Bloome Email: dbloome@fas.harvard.edu Section Times Thursday 2pm-3pm in SC 109, Thursday 5pm-6pm in SC 705 Office Hours: Thursday 6pm-7pm SC
More informationKafka & Redis for Big Data Solutions
Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)
More informationLoad Balancing. Load Balancing 1 / 24
Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait
More informationMath/Stats 342: Solutions to Homework
Math/Stats 342: Solutions to Homework Steven Miller (sjm1@williams.edu) November 17, 2011 Abstract Below are solutions / sketches of solutions to the homework problems from Math/Stats 342: Probability
More informationPractice Problems for Homework #6. Normal distribution and Central Limit Theorem.
Practice Problems for Homework #6. Normal distribution and Central Limit Theorem. 1. Read Section 3.4.6 about the Normal distribution and Section 4.7 about the Central Limit Theorem. 2. Solve the practice
More informationRelational Database: Additional Operations on Relations; SQL
Relational Database: Additional Operations on Relations; SQL Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Overview The course packet
More informationCommon Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com
Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data
More informationFactoring & Primality
Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount
More informationInteractive Machine Learning. Maria-Florina Balcan
Interactive Machine Learning Maria-Florina Balcan Machine Learning Image Classification Document Categorization Speech Recognition Protein Classification Branch Prediction Fraud Detection Spam Detection
More information(Refer Slide Time: 01.26)
Discrete Mathematical Structures Dr. Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture # 27 Pigeonhole Principle In the next few lectures
More informationLecture 1: Data Storage & Index
Lecture 1: Data Storage & Index R&G Chapter 8-11 Concurrency control Query Execution and Optimization Relational Operators File & Access Methods Buffer Management Disk Space Management Recovery Manager
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationWeek 1: Introduction to Online Learning
Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider
More informationBIG DATA, MAPREDUCE & HADOOP
BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS
More informationPhysical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
More informationIP Subnetting: Practical Subnet Design and Address Determination Example
IP Subnetting: Practical Subnet Design and Address Determination Example When educators ask students what they consider to be the most confusing aspect in learning about networking, many say that it is
More informationarxiv:1112.0829v1 [math.pr] 5 Dec 2011
How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly
More informationBinomial random variables (Review)
Poisson / Empirical Rule Approximations / Hypergeometric Solutions STAT-UB.3 Statistics for Business Control and Regression Models Binomial random variables (Review. Suppose that you are rolling a die
More informationSession 7 Fractions and Decimals
Key Terms in This Session Session 7 Fractions and Decimals Previously Introduced prime number rational numbers New in This Session period repeating decimal terminating decimal Introduction In this session,
More informationNumbers 101: Cost and Value Over Time
The Anderson School at UCLA POL 2000-09 Numbers 101: Cost and Value Over Time Copyright 2000 by Richard P. Rumelt. We use the tool called discounting to compare money amounts received or paid at different
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
More informationBounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances
Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances Jialin Zhang Tsinghua University zhanggl02@mails.tsinghua.edu.cn Wei Chen Microsoft Research Asia weic@microsoft.com Abstract
More informationStanford Math Circle: Sunday, May 9, 2010 Square-Triangular Numbers, Pell s Equation, and Continued Fractions
Stanford Math Circle: Sunday, May 9, 00 Square-Triangular Numbers, Pell s Equation, and Continued Fractions Recall that triangular numbers are numbers of the form T m = numbers that can be arranged in
More informationLecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
More information1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++
Answer the following 1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++ 2) Which data structure is needed to convert infix notations to postfix notations? Stack 3) The
More informationData Structures. Algorithm Performance and Big O Analysis
Data Structures Algorithm Performance and Big O Analysis What s an Algorithm? a clearly specified set of instructions to be followed to solve a problem. In essence: A computer program. In detail: Defined
More information