# Big Data & Scripting Part II Streaming Algorithms

Save this PDF as:
Size: px
Start display at page:

Download "Big Data & Scripting Part II Streaming Algorithms"

## Transcription

1 Big Data & Scripting Part II Streaming Algorithms 1,

2 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set), retain only elements matching that criterion example scenario: stream of requests (user,request) sampling requests is straightforward (e.g. which pages are accessed most frequently) analyzing the distribution of frequencies is more complicated that is, we want to know, how many queries are repeated x times (for all x)

3 3, sampling and filtering example n = 200, 000 events, m = 40, 000 different requests, uniform distribution all queries % sample s id

4 sampling and filtering example same dataset, but vs. # queries with this all queries by number of queries with number of queries with % sample by completely different distributions due to sampling 4,

5 5, sampling and filtering example same dataset, but vs. # queries with this this time sample is selected by a fixed subset of ids all queries by number of queries with corrected 10% sample by number of queries with

6 Histograms and Frequency Skews 6,

7 7, stream and histogram consider the following input: objects/buckets time as time/stream progresses, data points come in e.g. users issue requests distinguished by some id or bucket (from hashing) some are seen more often (e.g. 4) some less often (e.g. 1) e.g. user 4 sending requests with high, user 1 only one request this is highly valuable information for an analysis

8 8, stream and histogram objects/buckets time to analyze these distributions, histograms are helpful: object

9 9, comparing histograms - different distributions an example of two different streams of observations: objects objects both have equal number of data points (10.000) and distinct objects (60) but objects have different probabilities to be observed sorting objects by frequencies makes the difference more obvious: objects objects

10 10, the plan information about the distribution of observation is crucial for many applications knowing the complete, exact histogram would be helpful is often not possible, due to the large number of distinct objects workaround: characterize histogram without knowing the complete picture characteristic properties easier to determine analogous to descriptions of distributions on R

11 11, characterizing distributions object m i : of object i number of distinct objects seen so far: i(m i ) 0 total number of objects seen so far: i(m i ) 1 = i m i generalization: M k = i(m i ) k kth moment

12 12, M 2 the second moment what we have so far M 0 Flajolet-Martin algorithm from last lecture M 1 counting combination: average M 1 /M 0 next: estimate M 2 = i m 2 i

13 13, M 2 the second moment objects M 2 = objects M 2 = Motivation M 2 describes the skewness of a distribution smaller M 2 less skewed distribution related to the Gini-Index (surprise index) used to limit approximation errors, query optimization in database systems

14 14, M 2 and Var(X) variance describes the distribution of values M 2 describes the distribution of their frequencies M 2 comparable to variance of frequencies: Var({m i }) = 1/N i(m i µ({m i })) 2

15 15, M 2 the second moment: approximation storing and counting distinct objects impossible approximation by Alon-Matias-Szegedy algorithm 1 : algorithm N observations in stream choose k random positions p j {1,..., N} when reaching position p j : store object at position start counting occurrences of this object in m j estimate: M 2 n/k( k i=1 (2m i 1)) 1 Alon, N.; Matias Y.; Szegedy, M.: The space complexity of approximating the moments, 1999

16 16, M 2 the second moment: example c e c f a e g f f b b c g b a a f d a e N=20 random positions 3, 7, 14, 5 position 3: encounter c, counting results in 2 position 7: encounter g, 2 position 14: b 1 position 5 a 4 estimate: M 2 20[2 (2 2 1) + (2 1 1) + (2 4 1] = = true value: M 2 = = 64

17 17, M 2 the second moment: summary the algorithm is simple to implement needs to store only the k counters gets more precise with larger k, proof idea: expected value of each counter is fraction of M 2 average of k counters approaches M 2 problem: N may not be known in the beginning

18 18, approximating M 2 with unknown stream length stream may be of unknown length or unlimited still each position must be chosen random and uniform from {1,..., N} solution keep count of k objects beginning with the first k when object at position p > k is processed: choose with probability k/(p + 1) drop existing element (chosen with equal probability) each position chosen with equal probability

19 clustering data streams 19,

20 20, clustering data streams the problem many formulations of the clustering problem possible wide application ranges, strong variance in preconditions objective function common ground: objects connected by relation identify groups of similar objects with respect to relation problem is intractable (N P-hard) some basic questions what kind of relation (e.g. binary, distance, similarity) can objects have a mean value (continuous space) what is a good cluster (objective function) possibility of overlapping clusters

21 21, clustering data streams STREAM in the following: a single example problem and a single algorithm k-median on a data stream in one pass with guaranteed approximation quality algorithm: STREAM Guha, Mishra,Motwani, O Callaghan: Clustering Data Streams,2000

22 22, clustering data streams the k-median problem input: objects X = {x i : i = 1,..., N} distance d : X X R every x i is seen once in arbitrary order (i = 1,..., N) k - number of clusters to find objective: identify k elements m 1,..., m k X (cluster centers) let N(m j ) = {x i X : j = arg min l 1,...,k d(x i, m l )} all x i for which m i is the nearest center minimize C({m 1,..., m k }) = k j=1 x i N(m j ) d(x i, m j )

23 23, clustering data streams approximating k-median for small problem instances k-median can be fixed parameter approximated fixed parameter approximation: C approx a Q opt (approximation is maximal by factor a worse than optimal solution for fixed a) this approximation is useful to approximate larger instances approximation (idea) k-medians can be stated as integer program P I this program can be relaxed to a linear program P L solution of P L can be rounded to solution of P I linear problems can be solved efficiently

24 clustering data streams weighted k-medians extending k-medians with weights: k-medians with weighted samples w : X R >0 : distance of objects to their centers multiplied by weight: C({m 1,..., m k }) = j i 1,...,N w(x i ) d(x i, m j ) k-medians is special case with unit weights weighted k-means can be approximated similar to k-means: algorithm can only be applied to small instances use it to solve small sub-problems in the following, use procedure: wkm() input: objects, weights, k output: k weighted centers runtime: O(n 2 ) 24,

25 25, first step - clustering with low memory approach: divide and conquer Small-Space(X) 1. divide X into l disjoint subsets X 1,..., X l 2. cluster each X i individually into l k clusters 3. result: X set of lk cluster centers 4. cluster X, using for each c X N(c) as weight 2. can be solved with a constant factor approximation: solution b times worse than optimum 4. can be solved with constant factor approximation not worse than c times optimum result: constant factor approximation partial solutions and their combination

26 26, extending to a solution Small-Space(X) 1. divide X into l disjoint subsets X 1,..., X l 2. cluster each X i individually into O(k) clusters 3. result: X set of O(lk) cluster centers 4. cluster X, using for each c X N(c) as weight constant factor approximation needs to cluster X i memory problem 1: size of subsets versus l needs to cluster X memory problem 2: clustering O(lk) elements

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

### Lecture 4 Online and streaming algorithms for clustering

CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

### Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

### Lecture 6 Online and streaming algorithms for clustering

CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

### Big Data & Scripting storage networks and distributed file systems

Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node

### Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear

### Data Warehousing und Data Mining

Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

### Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 12: June 22, 2012. Abstract. Review session.

June 23, 2012 1 review session Basic Data Analysis Stephen Turnbull Business Administration and Public Policy Lecture 12: June 22, 2012 Review session. Abstract Quantitative methods in business Accounting

### Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we

### Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven) Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii)

### 1 Formulating The Low Degree Testing Problem

6.895 PCP and Hardness of Approximation MIT, Fall 2010 Lecture 5: Linearity Testing Lecturer: Dana Moshkovitz Scribe: Gregory Minton and Dana Moshkovitz In the last lecture, we proved a weak PCP Theorem,

### Mining Data Streams. Chapter 4. 4.1 The Stream Data Model

Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make

### Approximation Algorithms

Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

### CSCE 310J Data Structures & Algorithms. Dynamic programming 0-1 Knapsack problem. Dynamic programming. Dynamic Programming. Knapsack problem (Review)

CSCE J Data Structures & Algorithms Dynamic programming - Knapsac problem Dr. Steve Goddard goddard@cse.unl.edu CSCE J Data Structures & Algorithms Giving credit where credit is due:» Most of slides for

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

### IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

### W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

### IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

### 2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

### New Hash Function Construction for Textual and Geometric Data Retrieval

Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan

### Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract

### Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

### Statistical Learning Theory Meets Big Data

Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician,

### Infrastructures for big data

Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)

### Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman To motivate the Bloom-filter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It

### Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

### D B M G Data Base and Data Mining Group of Politecnico di Torino

Database Management Data Base and Data Mining Group of tania.cerquitelli@polito.it A.A. 2014-2015 Optimizer objective A SQL statement can be executed in many different ways The query optimizer determines

### Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important

### MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

### The Conference Call Search Problem in Wireless Networks

The Conference Call Search Problem in Wireless Networks Leah Epstein 1, and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. lea@math.haifa.ac.il 2 Department of Statistics,

### Online and Offline Selling in Limit Order Markets

Online and Offline Selling in Limit Order Markets Kevin L. Chang 1 and Aaron Johnson 2 1 Yahoo Inc. klchang@yahoo-inc.com 2 Yale University ajohnson@cs.yale.edu Abstract. Completely automated electronic

### Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

### Inference of Probability Distributions for Trust and Security applications

Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations

### Northumberland Knowledge

Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about

### Determining optimal window size for texture feature extraction methods

IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec

### Lecture 8. Confidence intervals and the central limit theorem

Lecture 8. Confidence intervals and the central limit theorem Mathematical Statistics and Discrete Mathematics November 25th, 2015 1 / 15 Central limit theorem Let X 1, X 2,... X n be a random sample of

### Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1

Recommender Systems Seminar Topic : Application Tung Do 28. Januar 2014 TU Darmstadt Thanh Tung Do 1 Agenda Google news personalization : Scalable Online Collaborative Filtering Algorithm, System Components

### Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

### STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

### Distributed Computing over Communication Networks: Maximal Independent Set

Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.

### IBM SPSS Direct Marketing 19

IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS

### Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

### Analysis of Algorithms I: Binary Search Trees

Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary

### Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Optimized Asynchronous Passive Multi-Channel Discovery of Beacon-Enabled Networks

t t Technische Universität Berlin Telecommunication Networks Group arxiv:1506.05255v1 [cs.ni] 17 Jun 2015 Optimized Asynchronous Passive Multi-Channel Discovery of Beacon-Enabled Networks Niels Karowski,

### JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004

Scientiae Mathematicae Japonicae Online, Vol. 10, (2004), 431 437 431 JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS Ondřej Čepeka and Shao Chin Sung b Received December May 12, 2003; revised February

### Offline sorting buffers on Line

Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

### Network Algorithms for Homeland Security

Network Algorithms for Homeland Security Mark Goldberg and Malik Magdon-Ismail Rensselaer Polytechnic Institute September 27, 2004. Collaborators J. Baumes, M. Krishmamoorthy, N. Preston, W. Wallace. Partially

### Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

### Poznań University of Technology

Poznań University of Technology Algorithms to Mitigate Partition Skew in MapReduce Applications J.Berlińska, M.Drozdowski Research Report RA-01/15 2015 Institute of Computing Science, Piotrowo 2, 60-965

### Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

Broadband Networks Prof. Dr. Abhay Karandikar Electrical Engineering Department Indian Institute of Technology, Bombay Lecture - 29 Voice over IP So, today we will discuss about voice over IP and internet

### Optimal shift scheduling with a global service level constraint

Optimal shift scheduling with a global service level constraint Ger Koole & Erik van der Sluis Vrije Universiteit Division of Mathematics and Computer Science De Boelelaan 1081a, 1081 HV Amsterdam The

### Chapter 2: Systems of Linear Equations and Matrices:

At the end of the lesson, you should be able to: Chapter 2: Systems of Linear Equations and Matrices: 2.1: Solutions of Linear Systems by the Echelon Method Define linear systems, unique solution, inconsistent,

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of

### B490 Mining the Big Data. 0 Introduction

B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A

### Arithmetic Coding: Introduction

Data Compression Arithmetic coding Arithmetic Coding: Introduction Allows using fractional parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation

### Content Delivery Networks. Shaxun Chen April 21, 2009

Content Delivery Networks Shaxun Chen April 21, 2009 Outline Introduction to CDN An Industry Example: Akamai A Research Example: CDN over Mobile Networks Conclusion Outline Introduction to CDN An Industry

### Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser Quiz 1.

Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik Demaine and Shafi Goldwasser Quiz 1 Quiz 1 Do not open this quiz booklet until you are directed

### Distributed and Scalable QoS Optimization for Dynamic Web Service Composition

Distributed and Scalable QoS Optimization for Dynamic Web Service Composition Mohammad Alrifai L3S Research Center Leibniz University of Hannover, Germany alrifai@l3s.de Supervised by: Prof. Dr. tech.

### Regression Clustering

Chapter 449 Introduction This algorithm provides for clustering in the multiple regression setting in which you have a dependent variable Y and one or more independent variables, the X s. The algorithm

### Innovative Techniques and Tools to Detect Data Quality Problems

Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful

### Data Streams A Tutorial

Data Streams A Tutorial Nicole Schweikardt Goethe-Universität Frankfurt am Main DEIS 10: GI-Dagstuhl Seminar on Data Exchange, Integration, and Streams Schloss Dagstuhl, November 8, 2010 Data Streams Situation:

### Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

### Counting Problems in Flash Storage Design

Flash Talk Counting Problems in Flash Storage Design Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu NVRAMOS 09, Jeju, Korea, October 2009-1-

### Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse

### GETTING STARTED WITH LABVIEW POINT-BY-POINT VIS

USER GUIDE GETTING STARTED WITH LABVIEW POINT-BY-POINT VIS Contents Using the LabVIEW Point-By-Point VI Libraries... 2 Initializing Point-By-Point VIs... 3 Frequently Asked Questions... 5 What Are the

### Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one

### Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

ITAT 213 Proceedings, CEUR Workshop Proceedings Vol. 13, pp. 58 64 http://ceur-ws.org/vol-13, Series ISSN 1613-73, c 213 V. Boža Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Chapter 13: Binary and Mixed-Integer Programming

Chapter 3: Binary and Mixed-Integer Programming The general branch and bound approach described in the previous chapter can be customized for special situations. This chapter addresses two special situations:

### Chapter 20: Data Analysis

Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

### Characterizing Task Usage Shapes in Google s Compute Clusters

Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### Data analysis process

Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis

### B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### CS/COE 1501 http://cs.pitt.edu/~bill/1501/

CS/COE 1501 http://cs.pitt.edu/~bill/1501/ Lecture 01 Course Introduction Meta-notes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge

### Advertising on the Web

Chapter 8 Advertising on the Web One of the big surprises of the 21st century has been the ability of all sorts of interesting Web applications to support themselves through advertising, rather than subscription.

### A Branch and Bound Algorithm for Solving the Binary Bi-level Linear Programming Problem

A Branch and Bound Algorithm for Solving the Binary Bi-level Linear Programming Problem John Karlof and Peter Hocking Mathematics and Statistics Department University of North Carolina Wilmington Wilmington,

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### XI 10.1. XI. Community Reinvestment Act Sampling Guidelines. Sampling Guidelines CRA. Introduction

Sampling Guidelines CRA Introduction This section provides sampling guidelines to assist examiners in selecting a sample of loans for review for CRA. General Sampling Guidelines Based on loan sampling,

### Unsupervised learning: Clustering

Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

### 16.1 MAPREDUCE. For personal use only, not for distribution. 333

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

### Going Big in Data Dimensionality:

LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

### Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

Outline NP-completeness Examples of Easy vs. Hard problems Euler circuit vs. Hamiltonian circuit Shortest Path vs. Longest Path 2-pairs sum vs. general Subset Sum Reducing one problem to another Clique

### Notes on Factoring. MA 206 Kurt Bryan

The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

### Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

### CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms

### Exploratory data analysis approaches unsupervised approaches. Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis

Exploratory data analysis approaches unsupervised approaches Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis Lecture overview Page 1 Ø Background Ø Revision Ø Other clustering methods

### Partitioning and Divide and Conquer Strategies

and Divide and Conquer Strategies Lecture 4 and Strategies Strategies Data partitioning aka domain decomposition Functional decomposition Lecture 4 and Strategies Quiz 4.1 For nuclear reactor simulation,

### CLUSTERING FOR FORENSIC ANALYSIS

IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 129-136 Impact Journals CLUSTERING FOR FORENSIC ANALYSIS

### The New NCCI Hazard Groups

The New NCCI Hazard Groups Greg Engl, PhD, FCAS, MAAA National Council on Compensation Insurance CAS Reinsurance Seminar June, 2006 Workers Compensation Session Agenda History of previous work Impact of

### BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential