Cloud and Big Data Summr Scool, Stockolm, Aug., 2015 Jffry D. Ullman
Givn a st of points, wit a notion of distanc btwn points, group t points into som numbr of clustrs, so tat mmbrs of a clustr ar clos to ac otr, wil mmbrs of diffrnt clustrs ar far. 2
3
Clustring in two dimnsions looks asy. Clustring small amounts of data looks asy. And in most cass, looks ar not dciving. 4
Many applications involv not 2, but 10 or 10,000 dimnsions. Hig-dimnsional spacs look diffrnt: almost all pairs of points ar at about t sam distanc. 5
Assum random points witin a bounding bo,.g., valus btwn 0 and 1 in ac dimnsion. In 2 dimnsions: a varity of distancs btwn 0 and 1.41. In 10,000 dimnsions, t distanc btwn two random points in any on dimnsion is distributd as a triangl. 6
T law of larg numbrs applis. Actual distanc btwn two random points is t sqrt of t sum of squars of ssntially t sam st of diffrncs. 7
8 Euclidan spacs av dimnsions, and points av coordinats in ac dimnsion. Distanc btwn points is usually t squarroot of t sum of t squars of t distancs in ac dimnsion. Non-Euclidan spacs av a distanc masur tat satisfis t triangl inquality d(,y) < d(,z) + d(z,y), but points do not rally av a position in t spac. Eampls: Jaccard and dit distancs.
9 Rprsnt a documnt by t st of words tat appar in t documnt. Documnts wit similar sts of words may b about t sam topic. Distanc btwn two documnts = Jaccard distanc of tir sts of words. Jaccard distanc = 1 Jaccard similarity.
Objcts ar squncs of {C,A,T,G}. Distanc btwn squncs = dit distanc = t minimum numbr of insrts and dlts ndd to turn on into t otr. 10
11 Hirarcical (Agglomrativ): Initially, ac point in clustr by itslf. Rpatdly combin t two narst clustrs into on. Point Assignmnt: Maintain a st of clustrs. Plac points into tir narst clustr.
12 Two important qustions: 1. How do you dtrmin t narnss of clustrs? 2. How do you rprsnt a clustr of mor tan on point?
13 Ky problm: as you build clustrs, ow do you rprsnt t location of ac clustr, to tll wic pair of clustrs is closst? Euclidan cas: ac clustr as a cntroid = avrag of its points. Masur intrclustr distancs by distancs of cntroids.
14 o (0,0) (5,3) o (1,2) o (1.5,1.5) (4.7,1.3) (1,1) o (2,1) o (4,1) (4.5,0.5) o (5,0)
15 T only locations w can talk about ar t points tmslvs. I.., tr is no avrag of two points. Approac 1: clustroid = point closst to otr points. Trat clustroid as if it wr cntroid, wn computing intrclustr distancs.
16 Possibl manings: 1. Smallst maimum distanc to t otr points. 2. Smallst avrag distanc to otr points. 3. Smallst sum of squars of distancs to otr points. 4. Etc., tc.
17 clustroid 1 2 3 6 5 4 clustroid intrclustr distanc
18 Approac 2: intrclustr distanc = minimum of t distancs btwn any two points, on from ac clustr. Approac 3: Pick a notion of cosion of clustrs,.g., maimum distanc from t clustroid. Mrg clustrs wos union is most cosiv.
19 Approac 1: Us t diamtr of t mrgd clustr = maimum distanc btwn points in t clustr. Approac 2: Us t avrag distanc btwn points in t clustr. Approac 3: Dnsity-basd approac: tak t diamtr or avrag distanc,.g., and divid by t numbr of points in t clustr. Praps rais t numbr of points to a powr first,.g., squar-root.
20 Assums Euclidan spac. Start by picking k, t numbr of clustrs. Initializ clustrs wit on point pr clustr. Eampl: pick on point at random, tn k-1 otr points, ac as far away as possibl from t prvious points. OK, as long as tr ar no outlirs (points tat ar far from any rasonabl clustr). Eampl: us a sampl of points, clustr tm by any mans, and us on point pr sampl clustr.
1. For ac point, plac it in t clustr wos currnt cntroid it is narst. 2. Aftr all points ar assignd, fi t cntroids of t k clustrs. 3. Optional: rassign all points to tir closst cntroid. Somtims movs points btwn clustrs. 21
22 Rassignd points 7 5 3 1 8 6 4 2 Clustrs aftr first round
23 Try diffrnt k, looking at t cang in t avrag distanc to cntroid, as k incrass. Avrag falls rapidly until rigt k, tn cangs littl. Avrag distanc to cntroid Bst valu of k k
24 Too fw; many long distancs to cntroid.
25 Just rigt; distancs ratr sort.
26 Too many; littl improvmnt in avrag distanc.
27 BFR (Bradly-Fayyad-Rina) is a variant of k- mans dsignd to andl vry larg (diskrsidnt) data sts. It assums tat clustrs ar normally distributd around a cntroid in a Euclidan spac. Standard dviations in diffrnt dimnsions may vary.
Points ar rad on main-mmory-full at a tim. Most points from prvious mmory loads ar summarizd by simpl statistics. To bgin, from t initial load w slct t initial k cntroids by som snsibl approac. 28
1. T discard st (DS): points clos noug to a cntroid to b summarizd. 2. T comprssion st (CS): groups of points tat ar clos togtr but not clos to any cntroid. Ty ar summarizd, but not assignd to a clustr. 3. T rtaind st (RS): isolatd points. 29
30 T discard st and ac comprssion st is summarizd by: 1. T numbr of points, N. 2. T vctor SUM, wos i t componnt is t sum of t coordinats of t points in t i t dimnsion. 3. T vctor SUMSQ: i t componnt = sum of squars of coordinats in i t dimnsion.
31 2d + 1 valus rprsnt any numbr of points. d = numbr of dimnsions. Avrags in ac dimnsion (cntroid coordinats) can b calculatd asily as SUM i /N. SUM i = i t componnt of SUM. Varianc in dimnsion i can b computd by: (SUMSQ i /N ) (SUM i /N ) 2 And t standard dviation is t squar root of tat.
32 Points in RS Comprssion sts. Tir points ar in CS. A clustr. Its points ar in DS. T cntroid
33 1. Find tos points tat ar sufficintly clos to a clustr cntroid; add tos points to tat clustr and t DS. 2. Us any main-mmory clustring algoritm to clustr t rmaining points and t old RS. Clustrs go to t CS; outlying points to t RS.
34 3. Adjust statistics of t clustrs to account for t nw points. Considr mrging comprssd sts in t CS. 4. If tis is t last round, mrg all comprssd sts in t CS and all RS points into tir narst clustr.
How do w dcid if a point is clos noug to a clustr tat w will add t point to tat clustr? How do w dcid wtr two comprssd sts dsrv to b combind into on? 35
36 W nd a way to dcid wtr to put a nw point into a clustr. BFR suggst two ways: 1. T Maalanobis distanc is lss tan a trsold. 2. Low likliood of t currntly narst cntroid canging.
37 Normalizd Euclidan distanc from cntroid. For point ( 1,, k ) and cntroid (c 1,, c k ): 1. Normaliz in ac dimnsion: y i = ( i -c i )/ i i = standard dviation in i t dimnsion. 2. Tak sum of t squars of t y i s. 3. Tak t squar root.
38 If clustrs ar normally distributd in d dimnsions, tn aftr transformation, on standard dviation = d. I.., 70% of t points of t clustr will av a Maalanobis distanc < d. Accpt a point for a clustr if its M.D. is < som trsold,.g. 4 standard dviations.
39 2
40 Comput t varianc of t combind subclustr. N, SUM, and SUMSQ allow us to mak tat calculation quickly. Combin if t varianc is blow som trsold. Many altrnativs: trat dimnsions diffrntly, considr dnsity.
41 Problm wit BFR/k-mans: Assums clustrs ar normally distributd in ac dimnsion. And as ar fid llipss at an angl ar not OK. CURE: Assums a Euclidan distanc. Allows clustrs to assum any sap.
42 salary ag
1. Pick a random sampl of points tat fit in main mmory. 2. Clustr ts points irarcically group narst points/clustrs. 3. For ac clustr, pick a sampl of points, as disprsd as possibl. 4. From t sampl, pick rprsntativs by moving tm (say) 20% toward t cntroid of t clustr. 43
44 salary ag
45 salary Pick (say) 4 rmot points for ac clustr. ag
46 salary Mov points (say) 20% toward t cntroid. ag
47 Now, visit ac point p in t data st. Plac it in t closst clustr. Normal dfinition of closst : tat clustr wit t closst (to p) among all t sampl points of all t clustrs.