Big Data & Scripting Part II Streaming Algorithms


 Grant Morgan
 4 years ago
 Views:
Transcription
1 Big Data & Scripting Part II Streaming Algorithms 1,
2 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set), retain only elements matching that criterion example scenario: stream of requests (user,request) sampling requests is straightforward (e.g. which pages are accessed most frequently) analyzing the distribution of frequencies is more complicated that is, we want to know, how many queries are repeated x times (for all x)
3 3, sampling and filtering example n = 200, 000 events, m = 40, 000 different requests, uniform distribution all queries % sample s id
4 sampling and filtering example same dataset, but vs. # queries with this all queries by number of queries with number of queries with % sample by completely different distributions due to sampling 4,
5 5, sampling and filtering example same dataset, but vs. # queries with this this time sample is selected by a fixed subset of ids all queries by number of queries with corrected 10% sample by number of queries with
6 Histograms and Frequency Skews 6,
7 7, stream and histogram consider the following input: objects/buckets time as time/stream progresses, data points come in e.g. users issue requests distinguished by some id or bucket (from hashing) some are seen more often (e.g. 4) some less often (e.g. 1) e.g. user 4 sending requests with high, user 1 only one request this is highly valuable information for an analysis
8 8, stream and histogram objects/buckets time to analyze these distributions, histograms are helpful: object
9 9, comparing histograms  different distributions an example of two different streams of observations: objects objects both have equal number of data points (10.000) and distinct objects (60) but objects have different probabilities to be observed sorting objects by frequencies makes the difference more obvious: objects objects
10 10, the plan information about the distribution of observation is crucial for many applications knowing the complete, exact histogram would be helpful is often not possible, due to the large number of distinct objects workaround: characterize histogram without knowing the complete picture characteristic properties easier to determine analogous to descriptions of distributions on R
11 11, characterizing distributions object m i : of object i number of distinct objects seen so far: i(m i ) 0 total number of objects seen so far: i(m i ) 1 = i m i generalization: M k = i(m i ) k kth moment
12 12, M 2 the second moment what we have so far M 0 FlajoletMartin algorithm from last lecture M 1 counting combination: average M 1 /M 0 next: estimate M 2 = i m 2 i
13 13, M 2 the second moment objects M 2 = objects M 2 = Motivation M 2 describes the skewness of a distribution smaller M 2 less skewed distribution related to the GiniIndex (surprise index) used to limit approximation errors, query optimization in database systems
14 14, M 2 and Var(X) variance describes the distribution of values M 2 describes the distribution of their frequencies M 2 comparable to variance of frequencies: Var({m i }) = 1/N i(m i µ({m i })) 2
15 15, M 2 the second moment: approximation storing and counting distinct objects impossible approximation by AlonMatiasSzegedy algorithm 1 : algorithm N observations in stream choose k random positions p j {1,..., N} when reaching position p j : store object at position start counting occurrences of this object in m j estimate: M 2 n/k( k i=1 (2m i 1)) 1 Alon, N.; Matias Y.; Szegedy, M.: The space complexity of approximating the moments, 1999
16 16, M 2 the second moment: example c e c f a e g f f b b c g b a a f d a e N=20 random positions 3, 7, 14, 5 position 3: encounter c, counting results in 2 position 7: encounter g, 2 position 14: b 1 position 5 a 4 estimate: M 2 20[2 (2 2 1) + (2 1 1) + (2 4 1] = = true value: M 2 = = 64
17 17, M 2 the second moment: summary the algorithm is simple to implement needs to store only the k counters gets more precise with larger k, proof idea: expected value of each counter is fraction of M 2 average of k counters approaches M 2 problem: N may not be known in the beginning
18 18, approximating M 2 with unknown stream length stream may be of unknown length or unlimited still each position must be chosen random and uniform from {1,..., N} solution keep count of k objects beginning with the first k when object at position p > k is processed: choose with probability k/(p + 1) drop existing element (chosen with equal probability) each position chosen with equal probability
19 clustering data streams 19,
20 20, clustering data streams the problem many formulations of the clustering problem possible wide application ranges, strong variance in preconditions objective function common ground: objects connected by relation identify groups of similar objects with respect to relation problem is intractable (N Phard) some basic questions what kind of relation (e.g. binary, distance, similarity) can objects have a mean value (continuous space) what is a good cluster (objective function) possibility of overlapping clusters
21 21, clustering data streams STREAM in the following: a single example problem and a single algorithm kmedian on a data stream in one pass with guaranteed approximation quality algorithm: STREAM Guha, Mishra,Motwani, O Callaghan: Clustering Data Streams,2000
22 22, clustering data streams the kmedian problem input: objects X = {x i : i = 1,..., N} distance d : X X R every x i is seen once in arbitrary order (i = 1,..., N) k  number of clusters to find objective: identify k elements m 1,..., m k X (cluster centers) let N(m j ) = {x i X : j = arg min l 1,...,k d(x i, m l )} all x i for which m i is the nearest center minimize C({m 1,..., m k }) = k j=1 x i N(m j ) d(x i, m j )
23 23, clustering data streams approximating kmedian for small problem instances kmedian can be fixed parameter approximated fixed parameter approximation: C approx a Q opt (approximation is maximal by factor a worse than optimal solution for fixed a) this approximation is useful to approximate larger instances approximation (idea) kmedians can be stated as integer program P I this program can be relaxed to a linear program P L solution of P L can be rounded to solution of P I linear problems can be solved efficiently
24 clustering data streams weighted kmedians extending kmedians with weights: kmedians with weighted samples w : X R >0 : distance of objects to their centers multiplied by weight: C({m 1,..., m k }) = j i 1,...,N w(x i ) d(x i, m j ) kmedians is special case with unit weights weighted kmeans can be approximated similar to kmeans: algorithm can only be applied to small instances use it to solve small subproblems in the following, use procedure: wkm() input: objects, weights, k output: k weighted centers runtime: O(n 2 ) 24,
25 25, first step  clustering with low memory approach: divide and conquer SmallSpace(X) 1. divide X into l disjoint subsets X 1,..., X l 2. cluster each X i individually into l k clusters 3. result: X set of lk cluster centers 4. cluster X, using for each c X N(c) as weight 2. can be solved with a constant factor approximation: solution b times worse than optimum 4. can be solved with constant factor approximation not worse than c times optimum result: constant factor approximation partial solutions and their combination
26 26, extending to a solution SmallSpace(X) 1. divide X into l disjoint subsets X 1,..., X l 2. cluster each X i individually into O(k) clusters 3. result: X set of O(lk) cluster centers 4. cluster X, using for each c X N(c) as weight constant factor approximation needs to cluster X i memory problem 1: size of subsets versus l needs to cluster X memory problem 2: clustering O(lk) elements
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
More informationLecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 Online kclustering To the extent that clustering takes place in the brain, it happens in an online
More informationBig Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
More informationLecture 6 Online and streaming algorithms for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 Online kclustering To the extent that clustering takes place in the brain, it happens in an online
More informationBig Data & Scripting storage networks and distributed file systems
Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node
More informationNimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
More informationData Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing GridFiles Kdtrees Ulf Leser: Data
More informationBasic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 12: June 22, 2012. Abstract. Review session.
June 23, 2012 1 review session Basic Data Analysis Stephen Turnbull Business Administration and Public Policy Lecture 12: June 22, 2012 Review session. Abstract Quantitative methods in business Accounting
More informationUniversal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.
Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we
More informationAlgorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)
Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven) Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii)
More information1 Formulating The Low Degree Testing Problem
6.895 PCP and Hardness of Approximation MIT, Fall 2010 Lecture 5: Linearity Testing Lecturer: Dana Moshkovitz Scribe: Gregory Minton and Dana Moshkovitz In the last lecture, we proved a weak PCP Theorem,
More informationMining Data Streams. Chapter 4. 4.1 The Stream Data Model
Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make
More informationApproximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NPCompleteness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
More informationCSCE 310J Data Structures & Algorithms. Dynamic programming 01 Knapsack problem. Dynamic programming. Dynamic Programming. Knapsack problem (Review)
CSCE J Data Structures & Algorithms Dynamic programming  Knapsac problem Dr. Steve Goddard goddard@cse.unl.edu CSCE J Data Structures & Algorithms Giving credit where credit is due:» Most of slides for
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationApplied Algorithm Design Lecture 5
Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationW6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More information2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
More informationNew Hash Function Construction for Textual and Geometric Data Retrieval
Latest Trends on Computers, Vol., pp.483489, ISBN 9789647434, ISSN 7945, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan
More informationSmartSample: An Efficient Algorithm for Clustering Large HighDimensional Datasets
SmartSample: An Efficient Algorithm for Clustering Large HighDimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, TelAviv University TelAviv 69978, Israel Abstract
More informationChapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
More informationStatistical Learning Theory Meets Big Data
Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician,
More informationInfrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
More informationCloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman To motivate the Bloomfilter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationD B M G Data Base and Data Mining Group of Politecnico di Torino
Database Management Data Base and Data Mining Group of tania.cerquitelli@polito.it A.A. 20142015 Optimizer objective A SQL statement can be executed in many different ways The query optimizer determines
More informationCloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman
Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationThe Conference Call Search Problem in Wireless Networks
The Conference Call Search Problem in Wireless Networks Leah Epstein 1, and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. lea@math.haifa.ac.il 2 Department of Statistics,
More informationOnline and Offline Selling in Limit Order Markets
Online and Offline Selling in Limit Order Markets Kevin L. Chang 1 and Aaron Johnson 2 1 Yahoo Inc. klchang@yahooinc.com 2 Yale University ajohnson@cs.yale.edu Abstract. Completely automated electronic
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationInference of Probability Distributions for Trust and Security applications
Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations
More informationNorthumberland Knowledge
Northumberland Knowledge Know Guide How to Analyse Data  November 2012  This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about
More informationDetermining optimal window size for texture feature extraction methods
IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237242, ISBN: 8480213515. Determining optimal window size for texture feature extraction methods Domènec
More informationLecture 8. Confidence intervals and the central limit theorem
Lecture 8. Confidence intervals and the central limit theorem Mathematical Statistics and Discrete Mathematics November 25th, 2015 1 / 15 Central limit theorem Let X 1, X 2,... X n be a random sample of
More informationRecommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1
Recommender Systems Seminar Topic : Application Tung Do 28. Januar 2014 TU Darmstadt Thanh Tung Do 1 Agenda Google news personalization : Scalable Online Collaborative Filtering Algorithm, System Components
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Webbased Analytics Table
More informationDistributed Computing over Communication Networks: Maximal Independent Set
Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.
More informationIBM SPSS Direct Marketing 19
IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS
More informationLoad Balancing in MapReduce Based on Scalable Cardinality Estimates
Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching
More informationAnalysis of Algorithms I: Binary Search Trees
Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary
More informationLargeScale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 59565963 Available at http://www.jofcis.com LargeScale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationOptimized Asynchronous Passive MultiChannel Discovery of BeaconEnabled Networks
t t Technische Universität Berlin Telecommunication Networks Group arxiv:1506.05255v1 [cs.ni] 17 Jun 2015 Optimized Asynchronous Passive MultiChannel Discovery of BeaconEnabled Networks Niels Karowski,
More informationJUSTINTIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004
Scientiae Mathematicae Japonicae Online, Vol. 10, (2004), 431 437 431 JUSTINTIME SCHEDULING WITH PERIODIC TIME SLOTS Ondřej Čepeka and Shao Chin Sung b Received December May 12, 2003; revised February
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationNetwork Algorithms for Homeland Security
Network Algorithms for Homeland Security Mark Goldberg and Malik MagdonIsmail Rensselaer Polytechnic Institute September 27, 2004. Collaborators J. Baumes, M. Krishmamoorthy, N. Preston, W. Wallace. Partially
More informationBig Data Technology MapReduce Motivation: Indexing in Search Engines
Big Data Technology MapReduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationPoznań University of Technology
Poznań University of Technology Algorithms to Mitigate Partition Skew in MapReduce Applications J.Berlińska, M.Drozdowski Research Report RA01/15 2015 Institute of Computing Science, Piotrowo 2, 60965
More informationBroadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture  29.
Broadband Networks Prof. Dr. Abhay Karandikar Electrical Engineering Department Indian Institute of Technology, Bombay Lecture  29 Voice over IP So, today we will discuss about voice over IP and internet
More informationOptimal shift scheduling with a global service level constraint
Optimal shift scheduling with a global service level constraint Ger Koole & Erik van der Sluis Vrije Universiteit Division of Mathematics and Computer Science De Boelelaan 1081a, 1081 HV Amsterdam The
More informationChapter 2: Systems of Linear Equations and Matrices:
At the end of the lesson, you should be able to: Chapter 2: Systems of Linear Equations and Matrices: 2.1: Solutions of Linear Systems by the Echelon Method Define linear systems, unique solution, inconsistent,
More information! Solve problem to optimality. ! Solve problem in polytime. ! Solve arbitrary instances of the problem. !approximation algorithm.
Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NPhard problem What should I do? A Theory says you're unlikely to find a polytime algorithm Must sacrifice one of
More informationB490 Mining the Big Data. 0 Introduction
B490 Mining the Big Data 0 Introduction Qin Zhang 11 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 21 Data Mining What is Data Mining? A
More informationArithmetic Coding: Introduction
Data Compression Arithmetic coding Arithmetic Coding: Introduction Allows using fractional parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation
More informationContent Delivery Networks. Shaxun Chen April 21, 2009
Content Delivery Networks Shaxun Chen April 21, 2009 Outline Introduction to CDN An Industry Example: Akamai A Research Example: CDN over Mobile Networks Conclusion Outline Introduction to CDN An Industry
More informationIntroduction to Algorithms March 10, 2004 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser Quiz 1.
Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik Demaine and Shafi Goldwasser Quiz 1 Quiz 1 Do not open this quiz booklet until you are directed
More informationDistributed and Scalable QoS Optimization for Dynamic Web Service Composition
Distributed and Scalable QoS Optimization for Dynamic Web Service Composition Mohammad Alrifai L3S Research Center Leibniz University of Hannover, Germany alrifai@l3s.de Supervised by: Prof. Dr. tech.
More informationRegression Clustering
Chapter 449 Introduction This algorithm provides for clustering in the multiple regression setting in which you have a dependent variable Y and one or more independent variables, the X s. The algorithm
More informationInnovative Techniques and Tools to Detect Data Quality Problems
Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful
More informationData Streams A Tutorial
Data Streams A Tutorial Nicole Schweikardt GoetheUniversität Frankfurt am Main DEIS 10: GIDagstuhl Seminar on Data Exchange, Integration, and Streams Schloss Dagstuhl, November 8, 2010 Data Streams Situation:
More informationUse of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,
More informationCounting Problems in Flash Storage Design
Flash Talk Counting Problems in Flash Storage Design Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu NVRAMOS 09, Jeju, Korea, October 20091
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationSorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)
Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse
More informationGETTING STARTED WITH LABVIEW POINTBYPOINT VIS
USER GUIDE GETTING STARTED WITH LABVIEW POINTBYPOINT VIS Contents Using the LabVIEW PointByPoint VI Libraries... 2 Initializing PointByPoint VIs... 3 Frequently Asked Questions... 5 What Are the
More informationChapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling
Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NPhard problem. What should I do? A. Theory says you're unlikely to find a polytime algorithm. Must sacrifice one
More informationExperimental Comparison of Set Intersection Algorithms for Inverted Indexing
ITAT 213 Proceedings, CEUR Workshop Proceedings Vol. 13, pp. 58 64 http://ceurws.org/vol13, Series ISSN 161373, c 213 V. Boža Experimental Comparison of Set Intersection Algorithms for Inverted Indexing
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationChapter 13: Binary and MixedInteger Programming
Chapter 3: Binary and MixedInteger Programming The general branch and bound approach described in the previous chapter can be customized for special situations. This chapter addresses two special situations:
More informationChapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.dbbook.com for conditions on reuse Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
More informationCharacterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
More informationLinear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
More informationData analysis process
Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis
More informationB669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 11 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of
More information! Solve problem to optimality. ! Solve problem in polytime. ! Solve arbitrary instances of the problem. #approximation algorithm.
Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NPhard problem What should I do? A Theory says you're unlikely to find a polytime algorithm Must sacrifice one of three
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationCS/COE 1501 http://cs.pitt.edu/~bill/1501/
CS/COE 1501 http://cs.pitt.edu/~bill/1501/ Lecture 01 Course Introduction Metanotes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge
More informationAdvertising on the Web
Chapter 8 Advertising on the Web One of the big surprises of the 21st century has been the ability of all sorts of interesting Web applications to support themselves through advertising, rather than subscription.
More informationA Branch and Bound Algorithm for Solving the Binary Bilevel Linear Programming Problem
A Branch and Bound Algorithm for Solving the Binary Bilevel Linear Programming Problem John Karlof and Peter Hocking Mathematics and Statistics Department University of North Carolina Wilmington Wilmington,
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationXI 10.1. XI. Community Reinvestment Act Sampling Guidelines. Sampling Guidelines CRA. Introduction
Sampling Guidelines CRA Introduction This section provides sampling guidelines to assist examiners in selecting a sample of loans for review for CRA. General Sampling Guidelines Based on loan sampling,
More informationUnsupervised learning: Clustering
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationGoing Big in Data Dimensionality:
LUDWIG MAXIMILIANS UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
More informationOutline. NPcompleteness. When is a problem easy? When is a problem hard? Today. Euler Circuits
Outline NPcompleteness Examples of Easy vs. Hard problems Euler circuit vs. Hamiltonian circuit Shortest Path vs. Longest Path 2pairs sum vs. general Subset Sum Reducing one problem to another Clique
More informationNotes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationCSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
More informationExploratory data analysis approaches unsupervised approaches. Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis
Exploratory data analysis approaches unsupervised approaches Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis Lecture overview Page 1 Ø Background Ø Revision Ø Other clustering methods
More informationPartitioning and Divide and Conquer Strategies
and Divide and Conquer Strategies Lecture 4 and Strategies Strategies Data partitioning aka domain decomposition Functional decomposition Lecture 4 and Strategies Quiz 4.1 For nuclear reactor simulation,
More informationCLUSTERING FOR FORENSIC ANALYSIS
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 23218843; ISSN(P): 23474599 Vol. 2, Issue 4, Apr 2014, 129136 Impact Journals CLUSTERING FOR FORENSIC ANALYSIS
More informationThe New NCCI Hazard Groups
The New NCCI Hazard Groups Greg Engl, PhD, FCAS, MAAA National Council on Compensation Insurance CAS Reinsurance Seminar June, 2006 Workers Compensation Session Agenda History of previous work Impact of
More informationBNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
More informationReversing Statistics for Scalable Test Databases Generation
Reversing Statistics for Scalable Test Databases Generation Entong Shen Lyublena Antova Pivotal (formerly Greenplum) DBTest 2013, New York, June 24 1 Motivation Data generators: functional and performance
More informationData Mining Project Report. Document Clustering. Meryem UzunPer
Data Mining Project Report Document Clustering Meryem UzunPer 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. Kmeans algorithm...
More information