Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Similar documents
Traffic Flow Analysis (2)

Question 3: How do you find the relative extrema of a function?

Econ 371: Answer Key for Problem Set 1 (Chapter 12-13)

Mathematics. Mathematics 3. hsn.uk.net. Higher HSN23000

QUANTITATIVE METHODS CLASSES WEEK SEVEN

5 2 index. e e. Prime numbers. Prime factors and factor trees. Powers. worked example 10. base. power

CPS 220 Theory of Computation REGULAR LANGUAGES. Regular expressions

AP Calculus AB 2008 Scoring Guidelines

Projections - 3D Viewing. Overview Lecture 4. Projection - 3D viewing. Projections. Projections Parallel Perspective

A Note on Approximating. the Normal Distribution Function

Parallel and Distributed Programming. Performance Metrics

Lecture 3: Diffusion: Fick s first law

SPECIAL VOWEL SOUNDS

by John Donald, Lecturer, School of Accounting, Economics and Finance, Deakin University, Australia

Repulsive Force

Category 7: Employee Commuting

The example is taken from Sect. 1.2 of Vol. 1 of the CPN book.

Binary Search Trees. Definition Of Binary Search Tree. Complexity Of Dictionary Operations get(), put() and remove()

Entity-Relationship Model

(Analytic Formula for the European Normal Black Scholes Formula)

File Interface Layout and Specifications

WORKERS' COMPENSATION ANALYST, 1774 SENIOR WORKERS' COMPENSATION ANALYST, 1769

ME 612 Metal Forming and Theory of Plasticity. 6. Strain

YouthWorks Youth Works (yüth- w rkz), n.

Rural and Remote Broadband Access: Issues and Solutions in Australia

LG has introduced the NeON 2, with newly developed Cello Technology which improves performance and reliability. Up to 320W 300W

Natural Gas & Electricity Prices

Incomplete 2-Port Vector Network Analyzer Calibration Methods

Current and Resistance

Continuity Cloud Virtual Firewall Guide

An Adaptive Clustering MAP Algorithm to Filter Speckle in Multilook SAR Images

HOMEWORK FOR UNIT 5-1: FORCE AND MOTION

Section 7.4: Exponential Growth and Decay

CPU. Rasterization. Per Vertex Operations & Primitive Assembly. Polynomial Evaluator. Frame Buffer. Per Fragment. Display List.

[ ] These are the motor parameters that are needed: Motor voltage constant. J total (lb-in-sec^2)

Instantaneous Rate of Change:

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Basis risk. When speaking about forward or futures contracts, basis risk is the market

Higher order mode damping considerations for the SPL cavities at CERN

SPREAD OPTION VALUATION AND THE FAST FOURIER TRANSFORM

Geometric Stratification of Accounting Data

81-1-ISD Economic Considerations of Heat Transfer on Sheet Metal Duct

Use a high-level conceptual data model (ER Model). Identify objects of interest (entities) and relationships between these objects

Free ACA SOLUTION (IRS 1094&1095 Reporting)

Adverse Selection and Moral Hazard in a Model With 2 States of the World

EFFECT OF GEOMETRICAL PARAMETERS ON HEAT TRANSFER PERFORMACE OF RECTANGULAR CIRCUMFERENTIAL FINS

Category 1: Purchased Goods and Services

Binary Search Trees. Definition Of Binary Search Tree. The Operation ascend() Example Binary Search Tree

Policies for Simultaneous Estimation and Optimization

Comparison between two approaches to overload control in a Real Server: local or hybrid solutions?

Sample Green Belt Certification Examination Questions with Answers

Key Management System Framework for Cloud Storage Singa Suparman, Eng Pin Kwang Temasek Polytechnic

C H A P T E R 1 Writing Reports with SAS

AC Circuits Three-Phase Circuits

Magic Message Maker Amaze your customers with this Gift of Caring communication piece

Section 2.3 Solving Right Triangle Trigonometry

Performance Evaluation

New Basis Functions. Section 8. Complex Fourier Series

Logo Design/Development 1-on-1

Financial Mathematics

December Homework- Week 1

Long run: Law of one price Purchasing Power Parity. Short run: Market for foreign exchange Factors affecting the market for foreign exchange

Upper Bounding the Price of Anarchy in Atomic Splittable Selfish Routing

METHODS FOR HANDLING TIED EVENTS IN THE COX PROPORTIONAL HAZARD MODEL

Sharp bounds for Sándor mean in terms of arithmetic, geometric and harmonic means

Remember you can apply online. It s quick and easy. Go to Title. Forename(s) Surname. Sex. Male Date of birth D

Cisco Data Virtualization

Far Field Estimations and Simulation Model Creation from Cable Bundle Scans

The Constrained Ski-Rental Problem and its Application to Online Cloud Cost Optimization

Switches and Indicators 01

Foreign Exchange Markets and Exchange Rates

LAB 3: VELOCITY AND ACCELERATION GRAPHS

TIME MANAGEMENT. 1 The Process for Effective Time Management 2 Barriers to Time Management 3 SMART Goals 4 The POWER Model e. Section 1.

Math Test Sections. The College Board: Expanding College Opportunity

Architecture of the proposed standard

Tangent Lines and Rates of Change

Keywords Cloud Computing, Service level agreement, cloud provider, business level policies, performance objectives.

Theoretical approach to algorithm for metrological comparison of two photothermal methods for measuring of the properties of materials

ACT Math Facts & Formulas

FACULTY SALARIES FALL NKU CUPA Data Compared To Published National Data

Maintain Your F5 Solution with Fast, Reliable Support

THE FUNDAMENTALS OF CURRENT SENSE TRANSFORMER DESIGN. Patrick A. Cattermole, Senior Applications Engineer MMG 10 Vansco Road, Toronto Ontario Canada

5.4 Exponential Functions: Differentiation and Integration TOOTLIFTST:

Research on the Anti-perspective Correction Algorithm of QR Barcode

An Broad outline of Redundant Array of Inexpensive Disks Shaifali Shrivastava 1 Department of Computer Science and Engineering AITR, Indore

Expert-Mediated Search

union scholars program APPLICATION DEADLINE: FEBRUARY 28 YOU CAN CHANGE THE WORLD... AND EARN MONEY FOR COLLEGE AT THE SAME TIME!

Verifying Numerical Convergence Rates

I. INTRODUCTION. Figure 1, The Input Display II. DESIGN PROCEDURE

Caution laser! Avoid direct eye contact with the laser beam!

Production Costing (Chapter 8 of W&W)

1 The Collocation Method

Note nine: Linear programming CSE Linear constraints and objective functions. 1.1 Introductory example. Copyright c Sanjoy Dasgupta 1

A Project Management framework for Software Implementation Planning and Management

2.1: The Derivative and the Tangent Line Problem

High Availability Architectures For Linux on IBM System z

NAVAL POSTGRADUATE SCHOOL

Capacitance and Dielectrics

Transcription:

Cloud and Big Data Summr Scool, Stockolm, Aug., 2015 Jffry D. Ullman

Givn a st of points, wit a notion of distanc btwn points, group t points into som numbr of clustrs, so tat mmbrs of a clustr ar clos to ac otr, wil mmbrs of diffrnt clustrs ar far. 2

3

Clustring in two dimnsions looks asy. Clustring small amounts of data looks asy. And in most cass, looks ar not dciving. 4

Many applications involv not 2, but 10 or 10,000 dimnsions. Hig-dimnsional spacs look diffrnt: almost all pairs of points ar at about t sam distanc. 5

Assum random points witin a bounding bo,.g., valus btwn 0 and 1 in ac dimnsion. In 2 dimnsions: a varity of distancs btwn 0 and 1.41. In 10,000 dimnsions, t distanc btwn two random points in any on dimnsion is distributd as a triangl. 6

T law of larg numbrs applis. Actual distanc btwn two random points is t sqrt of t sum of squars of ssntially t sam st of diffrncs. 7

8 Euclidan spacs av dimnsions, and points av coordinats in ac dimnsion. Distanc btwn points is usually t squarroot of t sum of t squars of t distancs in ac dimnsion. Non-Euclidan spacs av a distanc masur tat satisfis t triangl inquality d(,y) < d(,z) + d(z,y), but points do not rally av a position in t spac. Eampls: Jaccard and dit distancs.

9 Rprsnt a documnt by t st of words tat appar in t documnt. Documnts wit similar sts of words may b about t sam topic. Distanc btwn two documnts = Jaccard distanc of tir sts of words. Jaccard distanc = 1 Jaccard similarity.

Objcts ar squncs of {C,A,T,G}. Distanc btwn squncs = dit distanc = t minimum numbr of insrts and dlts ndd to turn on into t otr. 10

11 Hirarcical (Agglomrativ): Initially, ac point in clustr by itslf. Rpatdly combin t two narst clustrs into on. Point Assignmnt: Maintain a st of clustrs. Plac points into tir narst clustr.

12 Two important qustions: 1. How do you dtrmin t narnss of clustrs? 2. How do you rprsnt a clustr of mor tan on point?

13 Ky problm: as you build clustrs, ow do you rprsnt t location of ac clustr, to tll wic pair of clustrs is closst? Euclidan cas: ac clustr as a cntroid = avrag of its points. Masur intrclustr distancs by distancs of cntroids.

14 o (0,0) (5,3) o (1,2) o (1.5,1.5) (4.7,1.3) (1,1) o (2,1) o (4,1) (4.5,0.5) o (5,0)

15 T only locations w can talk about ar t points tmslvs. I.., tr is no avrag of two points. Approac 1: clustroid = point closst to otr points. Trat clustroid as if it wr cntroid, wn computing intrclustr distancs.

16 Possibl manings: 1. Smallst maimum distanc to t otr points. 2. Smallst avrag distanc to otr points. 3. Smallst sum of squars of distancs to otr points. 4. Etc., tc.

17 clustroid 1 2 3 6 5 4 clustroid intrclustr distanc

18 Approac 2: intrclustr distanc = minimum of t distancs btwn any two points, on from ac clustr. Approac 3: Pick a notion of cosion of clustrs,.g., maimum distanc from t clustroid. Mrg clustrs wos union is most cosiv.

19 Approac 1: Us t diamtr of t mrgd clustr = maimum distanc btwn points in t clustr. Approac 2: Us t avrag distanc btwn points in t clustr. Approac 3: Dnsity-basd approac: tak t diamtr or avrag distanc,.g., and divid by t numbr of points in t clustr. Praps rais t numbr of points to a powr first,.g., squar-root.

20 Assums Euclidan spac. Start by picking k, t numbr of clustrs. Initializ clustrs wit on point pr clustr. Eampl: pick on point at random, tn k-1 otr points, ac as far away as possibl from t prvious points. OK, as long as tr ar no outlirs (points tat ar far from any rasonabl clustr). Eampl: us a sampl of points, clustr tm by any mans, and us on point pr sampl clustr.

1. For ac point, plac it in t clustr wos currnt cntroid it is narst. 2. Aftr all points ar assignd, fi t cntroids of t k clustrs. 3. Optional: rassign all points to tir closst cntroid. Somtims movs points btwn clustrs. 21

22 Rassignd points 7 5 3 1 8 6 4 2 Clustrs aftr first round

23 Try diffrnt k, looking at t cang in t avrag distanc to cntroid, as k incrass. Avrag falls rapidly until rigt k, tn cangs littl. Avrag distanc to cntroid Bst valu of k k

24 Too fw; many long distancs to cntroid.

25 Just rigt; distancs ratr sort.

26 Too many; littl improvmnt in avrag distanc.

27 BFR (Bradly-Fayyad-Rina) is a variant of k- mans dsignd to andl vry larg (diskrsidnt) data sts. It assums tat clustrs ar normally distributd around a cntroid in a Euclidan spac. Standard dviations in diffrnt dimnsions may vary.

Points ar rad on main-mmory-full at a tim. Most points from prvious mmory loads ar summarizd by simpl statistics. To bgin, from t initial load w slct t initial k cntroids by som snsibl approac. 28

1. T discard st (DS): points clos noug to a cntroid to b summarizd. 2. T comprssion st (CS): groups of points tat ar clos togtr but not clos to any cntroid. Ty ar summarizd, but not assignd to a clustr. 3. T rtaind st (RS): isolatd points. 29

30 T discard st and ac comprssion st is summarizd by: 1. T numbr of points, N. 2. T vctor SUM, wos i t componnt is t sum of t coordinats of t points in t i t dimnsion. 3. T vctor SUMSQ: i t componnt = sum of squars of coordinats in i t dimnsion.

31 2d + 1 valus rprsnt any numbr of points. d = numbr of dimnsions. Avrags in ac dimnsion (cntroid coordinats) can b calculatd asily as SUM i /N. SUM i = i t componnt of SUM. Varianc in dimnsion i can b computd by: (SUMSQ i /N ) (SUM i /N ) 2 And t standard dviation is t squar root of tat.

32 Points in RS Comprssion sts. Tir points ar in CS. A clustr. Its points ar in DS. T cntroid

33 1. Find tos points tat ar sufficintly clos to a clustr cntroid; add tos points to tat clustr and t DS. 2. Us any main-mmory clustring algoritm to clustr t rmaining points and t old RS. Clustrs go to t CS; outlying points to t RS.

34 3. Adjust statistics of t clustrs to account for t nw points. Considr mrging comprssd sts in t CS. 4. If tis is t last round, mrg all comprssd sts in t CS and all RS points into tir narst clustr.

How do w dcid if a point is clos noug to a clustr tat w will add t point to tat clustr? How do w dcid wtr two comprssd sts dsrv to b combind into on? 35

36 W nd a way to dcid wtr to put a nw point into a clustr. BFR suggst two ways: 1. T Maalanobis distanc is lss tan a trsold. 2. Low likliood of t currntly narst cntroid canging.

37 Normalizd Euclidan distanc from cntroid. For point ( 1,, k ) and cntroid (c 1,, c k ): 1. Normaliz in ac dimnsion: y i = ( i -c i )/ i i = standard dviation in i t dimnsion. 2. Tak sum of t squars of t y i s. 3. Tak t squar root.

38 If clustrs ar normally distributd in d dimnsions, tn aftr transformation, on standard dviation = d. I.., 70% of t points of t clustr will av a Maalanobis distanc < d. Accpt a point for a clustr if its M.D. is < som trsold,.g. 4 standard dviations.

39 2

40 Comput t varianc of t combind subclustr. N, SUM, and SUMSQ allow us to mak tat calculation quickly. Combin if t varianc is blow som trsold. Many altrnativs: trat dimnsions diffrntly, considr dnsity.

41 Problm wit BFR/k-mans: Assums clustrs ar normally distributd in ac dimnsion. And as ar fid llipss at an angl ar not OK. CURE: Assums a Euclidan distanc. Allows clustrs to assum any sap.

42 salary ag

1. Pick a random sampl of points tat fit in main mmory. 2. Clustr ts points irarcically group narst points/clustrs. 3. For ac clustr, pick a sampl of points, as disprsd as possibl. 4. From t sampl, pick rprsntativs by moving tm (say) 20% toward t cntroid of t clustr. 43

44 salary ag

45 salary Pick (say) 4 rmot points for ac clustr. ag

46 salary Mov points (say) 20% toward t cntroid. ag

47 Now, visit ac point p in t data st. Plac it in t closst clustr. Normal dfinition of closst : tat clustr wit t closst (to p) among all t sampl points of all t clustrs.