Big Data & Scripting Part II Streaming Algorithms
|
|
|
- Grant Morgan
- 10 years ago
- Views:
Transcription
1 Big Data & Scripting Part II Streaming Algorithms 1,
2 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set), retain only elements matching that criterion example scenario: stream of requests (user,request) sampling requests is straightforward (e.g. which pages are accessed most frequently) analyzing the distribution of frequencies is more complicated that is, we want to know, how many queries are repeated x times (for all x)
3 3, sampling and filtering example n = 200, 000 events, m = 40, 000 different requests, uniform distribution all queries % sample s id
4 sampling and filtering example same dataset, but vs. # queries with this all queries by number of queries with number of queries with % sample by completely different distributions due to sampling 4,
5 5, sampling and filtering example same dataset, but vs. # queries with this this time sample is selected by a fixed subset of ids all queries by number of queries with corrected 10% sample by number of queries with
6 Histograms and Frequency Skews 6,
7 7, stream and histogram consider the following input: objects/buckets time as time/stream progresses, data points come in e.g. users issue requests distinguished by some id or bucket (from hashing) some are seen more often (e.g. 4) some less often (e.g. 1) e.g. user 4 sending requests with high, user 1 only one request this is highly valuable information for an analysis
8 8, stream and histogram objects/buckets time to analyze these distributions, histograms are helpful: object
9 9, comparing histograms - different distributions an example of two different streams of observations: objects objects both have equal number of data points (10.000) and distinct objects (60) but objects have different probabilities to be observed sorting objects by frequencies makes the difference more obvious: objects objects
10 10, the plan information about the distribution of observation is crucial for many applications knowing the complete, exact histogram would be helpful is often not possible, due to the large number of distinct objects workaround: characterize histogram without knowing the complete picture characteristic properties easier to determine analogous to descriptions of distributions on R
11 11, characterizing distributions object m i : of object i number of distinct objects seen so far: i(m i ) 0 total number of objects seen so far: i(m i ) 1 = i m i generalization: M k = i(m i ) k kth moment
12 12, M 2 the second moment what we have so far M 0 Flajolet-Martin algorithm from last lecture M 1 counting combination: average M 1 /M 0 next: estimate M 2 = i m 2 i
13 13, M 2 the second moment objects M 2 = objects M 2 = Motivation M 2 describes the skewness of a distribution smaller M 2 less skewed distribution related to the Gini-Index (surprise index) used to limit approximation errors, query optimization in database systems
14 14, M 2 and Var(X) variance describes the distribution of values M 2 describes the distribution of their frequencies M 2 comparable to variance of frequencies: Var({m i }) = 1/N i(m i µ({m i })) 2
15 15, M 2 the second moment: approximation storing and counting distinct objects impossible approximation by Alon-Matias-Szegedy algorithm 1 : algorithm N observations in stream choose k random positions p j {1,..., N} when reaching position p j : store object at position start counting occurrences of this object in m j estimate: M 2 n/k( k i=1 (2m i 1)) 1 Alon, N.; Matias Y.; Szegedy, M.: The space complexity of approximating the moments, 1999
16 16, M 2 the second moment: example c e c f a e g f f b b c g b a a f d a e N=20 random positions 3, 7, 14, 5 position 3: encounter c, counting results in 2 position 7: encounter g, 2 position 14: b 1 position 5 a 4 estimate: M 2 20[2 (2 2 1) + (2 1 1) + (2 4 1] = = true value: M 2 = = 64
17 17, M 2 the second moment: summary the algorithm is simple to implement needs to store only the k counters gets more precise with larger k, proof idea: expected value of each counter is fraction of M 2 average of k counters approaches M 2 problem: N may not be known in the beginning
18 18, approximating M 2 with unknown stream length stream may be of unknown length or unlimited still each position must be chosen random and uniform from {1,..., N} solution keep count of k objects beginning with the first k when object at position p > k is processed: choose with probability k/(p + 1) drop existing element (chosen with equal probability) each position chosen with equal probability
19 clustering data streams 19,
20 20, clustering data streams the problem many formulations of the clustering problem possible wide application ranges, strong variance in preconditions objective function common ground: objects connected by relation identify groups of similar objects with respect to relation problem is intractable (N P-hard) some basic questions what kind of relation (e.g. binary, distance, similarity) can objects have a mean value (continuous space) what is a good cluster (objective function) possibility of overlapping clusters
21 21, clustering data streams STREAM in the following: a single example problem and a single algorithm k-median on a data stream in one pass with guaranteed approximation quality algorithm: STREAM Guha, Mishra,Motwani, O Callaghan: Clustering Data Streams,2000
22 22, clustering data streams the k-median problem input: objects X = {x i : i = 1,..., N} distance d : X X R every x i is seen once in arbitrary order (i = 1,..., N) k - number of clusters to find objective: identify k elements m 1,..., m k X (cluster centers) let N(m j ) = {x i X : j = arg min l 1,...,k d(x i, m l )} all x i for which m i is the nearest center minimize C({m 1,..., m k }) = k j=1 x i N(m j ) d(x i, m j )
23 23, clustering data streams approximating k-median for small problem instances k-median can be fixed parameter approximated fixed parameter approximation: C approx a Q opt (approximation is maximal by factor a worse than optimal solution for fixed a) this approximation is useful to approximate larger instances approximation (idea) k-medians can be stated as integer program P I this program can be relaxed to a linear program P L solution of P L can be rounded to solution of P I linear problems can be solved efficiently
24 clustering data streams weighted k-medians extending k-medians with weights: k-medians with weighted samples w : X R >0 : distance of objects to their centers multiplied by weight: C({m 1,..., m k }) = j i 1,...,N w(x i ) d(x i, m j ) k-medians is special case with unit weights weighted k-means can be approximated similar to k-means: algorithm can only be applied to small instances use it to solve small sub-problems in the following, use procedure: wkm() input: objects, weights, k output: k weighted centers runtime: O(n 2 ) 24,
25 25, first step - clustering with low memory approach: divide and conquer Small-Space(X) 1. divide X into l disjoint subsets X 1,..., X l 2. cluster each X i individually into l k clusters 3. result: X set of lk cluster centers 4. cluster X, using for each c X N(c) as weight 2. can be solved with a constant factor approximation: solution b times worse than optimum 4. can be solved with constant factor approximation not worse than c times optimum result: constant factor approximation partial solutions and their combination
26 26, extending to a solution Small-Space(X) 1. divide X into l disjoint subsets X 1,..., X l 2. cluster each X i individually into O(k) clusters 3. result: X set of O(lk) cluster centers 4. cluster X, using for each c X N(c) as weight constant factor approximation needs to cluster X i memory problem 1: size of subsets versus l needs to cluster X memory problem 2: clustering O(lk) elements
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
Lecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Lecture 6 Online and streaming algorithms for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
The Advantages and Disadvantages of Network Computing Nodes
Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node
Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
Data Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data
Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 12: June 22, 2012. Abstract. Review session.
June 23, 2012 1 review session Basic Data Analysis Stephen Turnbull Business Administration and Public Policy Lecture 12: June 22, 2012 Review session. Abstract Quantitative methods in business Accounting
Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.
Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we
Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)
Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven) Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii)
1 Formulating The Low Degree Testing Problem
6.895 PCP and Hardness of Approximation MIT, Fall 2010 Lecture 5: Linearity Testing Lecturer: Dana Moshkovitz Scribe: Gregory Minton and Dana Moshkovitz In the last lecture, we proved a weak PCP Theorem,
Mining Data Streams. Chapter 4. 4.1 The Stream Data Model
Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make
Approximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
CSCE 310J Data Structures & Algorithms. Dynamic programming 0-1 Knapsack problem. Dynamic programming. Dynamic Programming. Knapsack problem (Review)
CSCE J Data Structures & Algorithms Dynamic programming - Knapsac problem Dr. Steve Goddard [email protected] CSCE J Data Structures & Algorithms Giving credit where credit is due:» Most of slides for
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Applied Algorithm Design Lecture 5
Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
New Hash Function Construction for Textual and Geometric Data Retrieval
Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan
Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets
Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract
Chapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
Infrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
D B M G Data Base and Data Mining Group of Politecnico di Torino
Database Management Data Base and Data Mining Group of [email protected] A.A. 2014-2015 Optimizer objective A SQL statement can be executed in many different ways The query optimizer determines
Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman
Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Online and Offline Selling in Limit Order Markets
Online and Offline Selling in Limit Order Markets Kevin L. Chang 1 and Aaron Johnson 2 1 Yahoo Inc. [email protected] 2 Yale University [email protected] Abstract. Completely automated electronic
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Inference of Probability Distributions for Trust and Security applications
Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations
Northumberland Knowledge
Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about
Determining optimal window size for texture feature extraction methods
IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec
Lecture 8. Confidence intervals and the central limit theorem
Lecture 8. Confidence intervals and the central limit theorem Mathematical Statistics and Discrete Mathematics November 25th, 2015 1 / 15 Central limit theorem Let X 1, X 2,... X n be a random sample of
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Distributed Computing over Communication Networks: Maximal Independent Set
Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.
IBM SPSS Direct Marketing 19
IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS
Load Balancing in MapReduce Based on Scalable Cardinality Estimates
Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching
Analysis of Algorithms I: Binary Search Trees
Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Passive Discovery Algorithms
t t Technische Universität Berlin Telecommunication Networks Group arxiv:1506.05255v1 [cs.ni] 17 Jun 2015 Optimized Asynchronous Passive Multi-Channel Discovery of Beacon-Enabled Networks Niels Karowski,
JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004
Scientiae Mathematicae Japonicae Online, Vol. 10, (2004), 431 437 431 JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS Ondřej Čepeka and Shao Chin Sung b Received December May 12, 2003; revised February
Offline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]
Network Algorithms for Homeland Security
Network Algorithms for Homeland Security Mark Goldberg and Malik Magdon-Ismail Rensselaer Polytechnic Institute September 27, 2004. Collaborators J. Baumes, M. Krishmamoorthy, N. Preston, W. Wallace. Partially
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.
Broadband Networks Prof. Dr. Abhay Karandikar Electrical Engineering Department Indian Institute of Technology, Bombay Lecture - 29 Voice over IP So, today we will discuss about voice over IP and internet
Optimal shift scheduling with a global service level constraint
Optimal shift scheduling with a global service level constraint Ger Koole & Erik van der Sluis Vrije Universiteit Division of Mathematics and Computer Science De Boelelaan 1081a, 1081 HV Amsterdam The
How To Understand And Solve A Linear Programming Problem
At the end of the lesson, you should be able to: Chapter 2: Systems of Linear Equations and Matrices: 2.1: Solutions of Linear Systems by the Echelon Method Define linear systems, unique solution, inconsistent,
! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.
Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of
B490 Mining the Big Data. 0 Introduction
B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A
Arithmetic Coding: Introduction
Data Compression Arithmetic coding Arithmetic Coding: Introduction Allows using fractional parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation
Content Delivery Networks. Shaxun Chen April 21, 2009
Content Delivery Networks Shaxun Chen April 21, 2009 Outline Introduction to CDN An Industry Example: Akamai A Research Example: CDN over Mobile Networks Conclusion Outline Introduction to CDN An Industry
Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser Quiz 1.
Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik Demaine and Shafi Goldwasser Quiz 1 Quiz 1 Do not open this quiz booklet until you are directed
Distributed and Scalable QoS Optimization for Dynamic Web Service Composition
Distributed and Scalable QoS Optimization for Dynamic Web Service Composition Mohammad Alrifai L3S Research Center Leibniz University of Hannover, Germany [email protected] Supervised by: Prof. Dr. tech.
Regression Clustering
Chapter 449 Introduction This algorithm provides for clustering in the multiple regression setting in which you have a dependent variable Y and one or more independent variables, the X s. The algorithm
Innovative Techniques and Tools to Detect Data Quality Problems
Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful
Data Streams A Tutorial
Data Streams A Tutorial Nicole Schweikardt Goethe-Universität Frankfurt am Main DEIS 10: GI-Dagstuhl Seminar on Data Exchange, Integration, and Streams Schloss Dagstuhl, November 8, 2010 Data Streams Situation:
Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,
Counting Problems in Flash Storage Design
Flash Talk Counting Problems in Flash Storage Design Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. [email protected] NVRAMOS 09, Jeju, Korea, October 2009-1-
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)
Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse
GETTING STARTED WITH LABVIEW POINT-BY-POINT VIS
USER GUIDE GETTING STARTED WITH LABVIEW POINT-BY-POINT VIS Contents Using the LabVIEW Point-By-Point VI Libraries... 2 Initializing Point-By-Point VIs... 3 Frequently Asked Questions... 5 What Are the
Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling
Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one
Experimental Comparison of Set Intersection Algorithms for Inverted Indexing
ITAT 213 Proceedings, CEUR Workshop Proceedings Vol. 13, pp. 58 64 http://ceur-ws.org/vol-13, Series ISSN 1613-73, c 213 V. Boža Experimental Comparison of Set Intersection Algorithms for Inverted Indexing
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Chapter 13: Binary and Mixed-Integer Programming
Chapter 3: Binary and Mixed-Integer Programming The general branch and bound approach described in the previous chapter can be customized for special situations. This chapter addresses two special situations:
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
Linear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
Data analysis process
Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis
B669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of
! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.
Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
CS/COE 1501 http://cs.pitt.edu/~bill/1501/
CS/COE 1501 http://cs.pitt.edu/~bill/1501/ Lecture 01 Course Introduction Meta-notes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge
Advertising on the Web
Chapter 8 Advertising on the Web One of the big surprises of the 21st century has been the ability of all sorts of interesting Web applications to support themselves through advertising, rather than subscription.
A Branch and Bound Algorithm for Solving the Binary Bi-level Linear Programming Problem
A Branch and Bound Algorithm for Solving the Binary Bi-level Linear Programming Problem John Karlof and Peter Hocking Mathematics and Statistics Department University of North Carolina Wilmington Wilmington,
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
XI 10.1. XI. Community Reinvestment Act Sampling Guidelines. Sampling Guidelines CRA. Introduction
Sampling Guidelines CRA Introduction This section provides sampling guidelines to assist examiners in selecting a sample of loans for review for CRA. General Sampling Guidelines Based on loan sampling,
Unsupervised learning: Clustering
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Going Big in Data Dimensionality:
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits
Outline NP-completeness Examples of Easy vs. Hard problems Euler circuit vs. Hamiltonian circuit Shortest Path vs. Longest Path 2-pairs sum vs. general Subset Sum Reducing one problem to another Clique
Notes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
Exploratory data analysis approaches unsupervised approaches. Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis
Exploratory data analysis approaches unsupervised approaches Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis Lecture overview Page 1 Ø Background Ø Revision Ø Other clustering methods
Partitioning and Divide and Conquer Strategies
and Divide and Conquer Strategies Lecture 4 and Strategies Strategies Data partitioning aka domain decomposition Functional decomposition Lecture 4 and Strategies Quiz 4.1 For nuclear reactor simulation,
CLUSTERING FOR FORENSIC ANALYSIS
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 129-136 Impact Journals CLUSTERING FOR FORENSIC ANALYSIS
The New NCCI Hazard Groups
The New NCCI Hazard Groups Greg Engl, PhD, FCAS, MAAA National Council on Compensation Insurance CAS Reinsurance Seminar June, 2006 Workers Compensation Session Agenda History of previous work Impact of
BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
