COSC 6397 Big Data Analtics Fundantals Edga Gabil Sping 2015 Ovviw Data Chaactistics Pfanc Chaactistics Platf Cnsidatins 1
What aks lag scal Data Analsis had? Oftn suaizd as VVVV Vlu: 5 Exabts f data catd until 2003 Th sa aunt f data catd in 2011 in tw das Estiat f 2013: 10 inuts f cating th sa aunt f data Exapl: a cunicatin svic pvid with 100 illin custs gnats ~5 ptabts f lcatin data p da F WWW t VVVV Vlcit: Thughput: aunt f data vd thugh th pips bil data vlus gwing at 78% p a Expctd t ach 10.8 xabts p nth in 2016 Latnc: Analtics usd t b st and pt Data shwn was f stda Ral-ti analtics gaining ppulait S svics availabl which guaant analsis in 10s 2
F WWW t VVVV Vait Data cs f a vait f sucs in diffnt fats Exapl: call cnt which nds t intgat infatin f Eail Tubl tickt Cnvsatin Scial dia blgs F WWW t VVVV Vacit Data suffs f significant cctnss and accuac pbls Cdibilit:.g. scial dia spns t a capaign shuld nt b basd n thid pat liks liks can b puchasd Rspns b disguntld pls Audinc Suitabilit Cust svic idntifing a pbl in a pduct has t sha th infatin slctivl 3
Analzing lag data vlus Lag: data than can b pcssd n a singl PC Taks t lng t b pcssd n a singl PC Th qustins Hw t utiliz ultipl pcsss Hw t valuatd whth w did a gd jb in using ultipl pcsss Adinistativ ptins f using ultipl pcsss f lag scal analsis Pfanc tics (I) Spdup: hw uch fast ds a pbl un n p pcsss cpad t 1 pcss? T S( p) T ttal ttal (1) ( p) Optial: S(p) = p (lina spdup) Paalll Efficinc: Spdup nalizd b th nub f pcsss S( p) E( p) p Optial: E(p) = 1.0 4
Pfanc tics (II) Exapl: Applicatin A taks 35 in. n a singl pcss, 27 n tw pcsss and 18 n 4 pcsss. 35 1.29 S( 2) 1.29 E( 2) 0. 645 27 2 35 S( 4) 1.94 485 18 1.94 E( 4) 0. 4 Adahl s Law (I) Basic ida: st applicatins hav a (sall) squntial factin, which liits th spdup T T T ft ( 1 f ) T ttal squntial f: factin f th cd which can nl b xcutd squntiall Tttal (1) S( p) 1 f ( f ) T p paalll ttal Ttal 1 1 f (1) f p Ttal 5
Exapl f Adahl s Law f=0 f=0.05 f=0.1 f=0.2 60.00 50.00 40.00 30.00 20.00 10.00 0.00 1 2 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Adahl s Law (II) Adahl s Law assus, that th pbl siz is cnstant In st applicatins, th squntial pat is indpndnt f th pbl siz, whil th pat which can b xcutd in paalll is nt. 6
Pfanc tics (III) Scalup: ati f th xcutin ti f a pbl f siz n n 1 pcss t th xcutin ti f th sa pbl f siz n*p n p pcsss Tttal (1, n) Sc ( p) T ( p, n* p) ttal Optiall, xcutin ti ains cnstant,.g. T ttal ( p, n) T (2 p,2n) ttal Clust Cputing Clust: cllctin f individual PC s (cput nds) cnnctd b a (high pfanc) ntwk intcnnct Each cput nd is an indpndnt ntit with its wn Pcss ain On ultipl ntwking cads All cput nds tpicall hav accss t a shad fil sst (.g. Ntwk Fil Sst (NFS) ) Rvs th ncssit t plicat pgas and data n all cput nds All accsss t fils qui cunicatin v th ntwk 7
Cncptual Viw Ntwk cad Had div Ntwk cad Had div Ntwk cad Had div Ntwk cad Had div Ntwk cad Had div Ntwk cad Had div Ntwk cad Had div Ntwk Intcnnct Ntwk cad Had div Clust Cpnnts (I) Cput nds stl basd n gula PC tchnlg Intl AD pcsss 1-4GB f ain p c Opating Ssts: tpicall Linux/UNIX anagnts f sucs: clust schdul anags allcatin f cput nds t uss 8
Clust Cpnnts (II) Ntwking tics: Latnc: inial ti t snd a v sht ssag f n cunicatin ndpint t an th ndpint Unit: s, μs Bandwidth: aunt f data which can b tansfd f n pcss t anth in a ctain ti fa Unit: Bts/sc,, GB/s; Bits/sc,, Gb/s Of-th-shlf tchnlg vs. High-End Tchnlg Gigabit-Ethnt vs. InfiniBand, 10GE, Ca Gini st clusts cntain bth, a high-nd and a lw-nd ntwk intcnnct Ntwk Tplg f iptanc f lag clusts If than n switch is quid hw a nds cnnctd tic: Bisctin bandwidth Paalll Databass A paalll databas sst sks t ipv pfanc thugh paalllizatin f vaius patins data is std in a distibutd fashin distibutin is gvnd b pfanc cnsidatins. ipvs pcssing and input/utput spds b using ultipl s and disks in paalll. Paalll databass ftn us ultipcss achitctu Shad achitctu: ultipl pcsss sha th ain spac Shad disk achitctu: ach nd has its wn ain, all nds sha ass stag Shad nthing achitctu: ach nd has its wn ass stag as wll as ain. 9
Advantags f Paalll Databas Ssts Sst uss an ptiiz t tanslat SQL cands int a qu plan whs xcutin is dividd ang cput nds High lvl pgaing (SQL) ds nt qui an knwldg f undling hadwa A lt f data is alad std in databas ssts 20+ as f xpinc in paalll databas ssts Disadvantag f Paalll Databas Ssts Databas ssts hav ctain quints n th data fat Difficult t handl igula, unstuctud incplt data sts Databas ssts nt fficint in adding lag data vlus Pic f lag scal paalll databas ssts 10
Clud Cputing Clud Cputing: gnal t usd t dscib a class f ntwk basd cputing a cllctin/gup f intgatd and ntwkd hadwa, sftwa and Intnt infastuctu (calld a platf). Using th Intnt f cunicatin and tanspt pvids hadwa, sftwa and ntwking svics t clints Hids th cplxit and dtails f th undling infastuctu f uss and applicatins b pviding v sipl gaphical intfac API Clud Cputing (II) Th platf pvids n dand svics, that a alwas n, anwh, anti and an plac. Pa f us and as ndd Scal up and dwn in capacit and functinalit Th hadwa and sftwa svics a availabl t gnal public, ntpiss, cpatins and businsss akts Svics data a hstd n t infastuctu 11
Clud Svic dls Sftwa as a Svic (SaaS): xcut a spcific applicatin quid f businss / sach Platf as a Svic (PaaS): dpl cust catd applicatins Infastuctu as a Svic (IaaS): nt pcssing and cput capacit, stag, tc. Clud Cputing Sua Psitiv N nd f lcal IT infastuctu Scalabilit Rliabilit nt a aj cncn Iplicit sftwa updats Ngativ N pfanc guaants utilizatin f shad sucs Pivac, scuit, cplianc, tust Nd t valuat utilizatin/csts bnfits 12
Initial hadwa invstnt csts Initial sftwa invstnt csts aintnanc csts Sftwa dvlpnt ffts Sftwa Flxibilit Cpaisn f th platfs Clust cputing Paalll Databas Clud Cputing High High Z Lw High Z Lw Lw-diu Z High Lw diu High Lw diu Efficinc High High Lw Csts p jb Lw Lw High 13