Towards Zero-Overhead Static and Adaptive Indexing in Hadoop
|
|
|
- Darlene Bruce
- 10 years ago
- Views:
Transcription
1 Nonme mnusript No. (will e inserted y the editor) Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop Stefn Rihter Jorge-Arnulfo Quiné-Ruiz Stefn Shuh Jens Dittrih the dte of reeipt nd eptne should e inserted lter Astrt Hdoop MpRedue hs evolved to n importnt industry stndrd for mssive prllel dt proessing nd hs eome widely dopted for vriety of use ses. Reent works hve shown tht indexes n improve the performne of seletive MpRedue jos drmtilly. However, one mjor wekness of existing pprohes re high index retion osts. We present HAIL (Hdoop Aggressive Indexing Lirry), novel indexing pproh for HDFS nd Hdoop MpRedue. HAIL retes different lustered indexes over terytes of dt with miniml, often invisile osts nd it drmtilly improves runtimes of severl lsses of MpRedue jos. HAIL fetures two different indexing pipelines, stti indexing nd dptive indexing. HAIL stti indexing effiiently indexes dtsets while uploding them to HDFS. Therey, HAIL leverges the defult replition of Hdoop nd enhnes it with logil replition. This llows HAIL to rete multiple lustered indexes for dtset, e.g. one for eh physil repli. Still, in terms of uplod time, HAIL mthes or even improves over the performne of stndrd HDFS. Additionlly, HAIL dptive indexing llows for utomti, inrementl indexing t jo runtime with miniml runtime overhed. For exmple, HAIL dptive indexing n ompletely index dtset s yprodut of only four MpRedue jos while inurring n overhed s low s 11% for the very first of those jo only. In our experiments, we show tht HAIL improves jo runtimes y up to 68x over Hdoop. This rtile is n extended version of the VLDB 212 pper Only Aggressive Elephnts re Fst Elephnts (PVLDB, 5(11): , 212). S. Rihter, S. Shuh, J. Dittrih Informtion Systems Group Srlnd University J.-A. Quiné-Ruiz Qtr Computing Reserh Institute Qtr Foundtion 1 Introdution MpRedue hs eome the de fto stndrd for lrge sle dt proessing in mny enterprises. It is used for developing novel solutions on mssive dtsets suh s we nlytis, reltionl dt nlytis, mhine lerning, dt mining, nd rel-time nlytis [23]. In prtiulr, log proessing emerges s n importnt type of dt nlysis ommonly done with MpRedue [5,36,18]. In ft, Feook nd Twitter use Hdoop MpRedue (the most populr MpRedue open soure implementtion) to nlyze the huge mounts of we logs generted every dy y their users [43,22,35]. Over the lst yers, lot of reserh works hve foused on improving the performne of Hdoop MpRedue [12, 26, 32, 34]. When improving the performne of MpRedue, it is importnt to onsider tht it ws initilly developed for lrge ggregtion tsks tht sn through huge mounts of dt. However, nowdys Hdoop is often lso used for seletive queries tht im to find only few relevnt reords for further onsidertion 1. For seletive queries, Hdoop still sns through the omplete dtset. This resemles the serh for needle in hystk. For this reson, severl reserhers hve prtiulrly foused on supporting effiient index ess in Hdoop [45, 15,35,33]. Some of these works hve improved the performne of seletive MpRedue jos y orders of mgnitude. However, ll these indexing pprohes hve three min weknesses. First, they require high upfront ost for index retion. This trnsltes to long witing times for users until they n tully strt to run queries. Seond, they n only support one physil sort order (nd hene one lustered index) per dtset. This eomes serious prolem if the worklod demnds indexes for severl ttriutes. Third, they require users to hve good knowledge of the worklod 1 A simple exmple of suh use se would e distriuted grep.
2 2 Stefn Rihter et l. in order to hoose the indexes to rete. This is not lwys possile, e.g. if the dt is nlyzed in n explortory wy or queries re sumitted y ustomers. 1.1 Motivtion Let us see through the eyes of dt nlyst, sy Bo, who wnts to nlyze lrge we log. The we log ontins different fields tht my serve s filter onditions for Bo like visitdte, drevenue, soureip nd so on. Assume Bo is interested in ll soureips with visitdte from 211. Thus, Bo writes MpRedue progrm to filter out extly those reords nd disrd ll others. Bo is using Hdoop, whih will sn the entire input dtset from disk to filter out the qulifying reords. This tkes while. After inspeting the result set Bo detets series of strnge requests from soureip Therefore, he deides to modify his MpRedue jo to show ll requests from the entire input dtset hving tht soureip. Bo is using Hdoop. This tkes while. Eventully, Bo deides to modify his MpRedue jo gin to only return log reords hving prtiulr drevenue. Yes, this gin tkes while. In summry, Bo uses sequene of different filter onditions, eh one triggering new MpRedue jo. He is not extly sure wht he is looking for. The whole endevor feels like going shopping without shopping list. This exmple illustrtes n explortory usge (nd mjor usese) of Hdoop MpRedue [5, 18, 38]. But, this use-se hs one mjor prolem: slow query runtimes. The time to exeute MpRedue jo sed on sn my e very high: it is dominted y the I/O for reding ll input dt [39,33]. While witing for his MpRedue jo to omplete, Bo hs enough time to pik offee (or two) nd this hppens every time Bo modifies the MpRedue jo. This will likely kill his produtivity nd mke his oss unhppy. Now, ssume the fortunte se tht Bo rememers sentene from one of his professors sying full-tle-sns re d; indexes re good 2. Thus, he reds ll the reent VLDB ppers (inluding [33,12,26,32]) nd finds pper tht shows how to rete so-lled trojn index [15]. A trojn index is n index tht my e used with Hdoop MpRedue nd yet does not modify the underlying Hdoop MpRedue nd HDFS engines. Zero-Overhed indexing. Bo finds the trojn index ide interesting nd hene deides to rete trojn index on soureip efore running his MpRedue jos. However, using trojn indexes rises two other prolems: (1.) Expensive index retion. The time to rete the trojn index on soureip (or ny other ttriute) is even muh longer thn running sn-sed MpRedue jo. Thus, if Bo s MpRedue jos use tht index only few times, the index retion osts will never e mortized. So, why would Bo rete suh n expensive index in the first ple? 2 The professor is wre tht for some situtions the opposite is true. (2.) Whih ttriute to index? Even if Bo mortizes index retion osts, the trojn index on soureip will only help for tht prtiulr ttriute. So, whih ttriute should Bo use to rete the index? Bo is wondering how to rete severl indexes t very low ost to solve those prolems. Per-Repli indexing. One dy in utumn 211, Bo reds out nother ide [34] where some reserhers looked t wys to improve vertil prtitioning in Hdoop. The reserhers in tht work relized tht HDFS keeps three (or more) physil opies of ll dt for fult-tolerne. Therefore, they deided to hnge HDFS to store eh physil opy in different dt lyout (row, olumn, PAX, or ny other olumn grouping lyout). As ll dt lyout trnsformtion is done per HDFS dt lok, the filover properties of HDFS nd Hdoop MpRedue were not ffeted. At the sme time, I/O times improved. Bo thinks tht this looks very promising, euse he ould possily exploit this onept to rete different lustered indexes lmost invisile to the user. This is euse he ould rete one lustered index per dt lok repli when uploding dt to HDFS. This would lredy help him lot in severl query worklods. However, Bo quikly figures out tht there re ses where this ide still hs some nnoying limittions. Even if Bo ould rete one lustered index per dt repli t low ost, he would still hve to determine whih ttriutes to index when uploding his dt to HDFS. Afterwrds, he ould not esily revise his deision or introdue dditionl indexes without uploding the dtset gin. Unfortuntely, it sometimes hppens tht Bo nd his ollegues nvigte through dtsets ording to the properties nd orreltions of the dt. In suh ses, Bo nd his ollegues typilly: (1.) do not know the dt ess ptterns in dvne; (2.) hve different interests nd hene nnot gree upon ommon seletion riteri t dt uplod time; (3.) even if they gree whih ttriutes to index t dt uplod time, they might end up filtering reords ording to vlues on different ttriutes. Therefore, using ny trditionl indexing tehnique [19, 1, 2,8,11,45,35,15,33] would e prolemti, euse they nnot dpt well to unknown or hnging query worklods. Adptive indexing. When serhing for solution to his prolem with stti indexing, Bo stumles ross new pproh lled dptive indexing [28], where the generl ide is to rete indexes s side-effet of query proessing. This is similr to the ide of soft indexes [37], where the system piggyks the index retion for given ttriute on single inoming query. However, in ontrst to soft indexes, dptive indexing ims t reting indexes inrementlly (i.e., piggyking on severl inoming queries) in order to void high upfront index retion times. Thus, Bo is exited out the dptive indexing ide sine this ould e the missing piee to solve his remining onern. However, Bo quikly noties tht he nnot simply pply existing
3 Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop 3 dptive indexing works [17,28,29,21,3,24] in MpRedue systems for severl resons: (1.) Glol index onvergene. These tehniques im t onverging to glol index for n entire ttriute, whih requires sorting the ttriute glolly. Therefore, these tehniques perform mny dt movements ross the entire dtset. Doing this in MpRedue would hurt fult-tolerne s well s the performne of MpRedue jos. This is euse the system would hve to move dt ross dt loks in syn with ll their three physil dt lok replis. We do not pln to rete glol indexes, ut fous on reting prtil indexes tht in totl over the whole dtset. A smll k of the envelope lultion shows tht the possile gins of glol index re negligile in omprison to the overhed of the MpRedue frmework. For instne, if dtset is uniformly distriuted over luster nd oupies 16 HDFS loks on eh dtnode (like the dtset in our experiments in Setion 9) nd we do not hve glol index, then we need to perform 16 index esses on eh dtnode. Sine ll dtnodes n ess their loks in prllel to eh other, we ssume tht the overhed is determined y the highest overhed per dtnode. Overll, our pproh requires t most 318 dditionl rndom reds in HDFS per dtnode in this senrio, whih in turn ost roughly 15ms eh. In totl, this mounts to 4.77s overhed ompred to glol index stored in HDFS. However, even empty MpRedue jos, tht do not red ny dt nor ompute single mp funtion, run for more thn 1s. (2.) High I/O osts. Even if Bo pplied existing dptive indexing tehniques inside dt loks, these tehniques would end up in mny ostly I/O opertions to move dt on disk. This is euse these tehniques onsider minmemory systems nd thus do not ftor in the I/O-ost for reding/writing dt from/to disk. Only one of these works [21] proposes n dptive merging tehnique for disksed systems. However, pplying this tehnique inside HDFS lok would not mke sense in MpRedue sine HDFS loks re typilly loded entirely into min memory nywys when proessing mp tsks. One my think out pplying dptive merging ross HDFS loks, ut this would gin hurt fult-tolerne nd the performne of MpRedue jos s desried ove. (3.) Unlustered index. These works fous on reting unlustered indexes in the first ple nd hene it is only enefiil for highly seletive queries. One of these works [29] introdued lzy tuple reorgnistion in order to onverge to lustered indexes. However, this tehnique needs severl thousnd queries to onverge nd its pplition in disksed system would gin introdue huge numer of expensive I/O opertions. (4.) Centrlized pproh. Existing dptive indexing pprohes were minly designed for single-node DBMSs. Therefore, pplying these works in distriuted prllel systems, like Hdoop MpRedue, would not fully exploit the existing prllelism to distriute the indexing effort ross severl omputing nodes. Despite ll these open prolems, Bo is very enthusisti to omine the ove interesting ides on indexing into new system to revolutionize the wy his ompny n use Hdoop. And this is where the story egins. 1.2 Reserh Questions nd Chllenges This rtile ddresses the following reserh questions: Zero-Overhed indexing. Current indexing pprohes in Hdoop involve signifint upfront ost for index retion. How n we mke indexing in Hdoop so effetive tht it is silly invisile for the user? How n we minimize the I/O osts for indexing or eventully redue them to zero? How n we fully utilize the ville CPU resoures nd prllelism of lrge lusters for indexing? Per-Repli indexing. Hdoop uses dt replition for filover. How n we exploit this replition to support different sort orders nd indexes? Whih hnges to the HDFS uplod pipeline need to e done to mke this effiient? Wht hppens to the involved heksum mehnism of HDFS? How n we teh the HDFS nmenode to distinguish the different replis nd keep trk of the different indexes? Jo exeution. How n we hnge Hdoop MpRedue to utilize different sort orders nd indexes t query time? How n we hnge Hdoop MpRedue to shedule tsks to replis hving the pproprite index? How n we shedule mp tsks to effiiently proess indexed nd non-indexed dt loks without ffeting filover? How muh do we need to hnge existing MpRedue jos? How will Hdoop MpRedue hnge from the user s perspetive? Zero-Overhed Adptive indexing. How n we dptively nd utomtilly rete dditionl useful indexes online t miniml osts per jo? How to index ig dt inrementlly in distriuted, disk-sed system like Hdoop s yprodut of jo exeution? How to minimize the impt of indexing on individul jo exeution times? How to effiiently interleve dt proessing with indexing? How to distriute the indexing effort effiiently y onsidering dtlolity nd index plement ross omputing nodes? How to rete severl lustered indexes t query time? How to support different numer of replis per dt lok? 1.3 Contriutions We propose HAIL (Hdoop Aggressive Indexing Lirry), stti nd dptive indexing pproh for MpRedue systems. The min gol of HAIL is to minimize oth (i) the index retion time when uploding dt nd (ii) the impt of onurrent index retion on jo exeution times. In summry, we mke the following min ontriutions to tkle the questions nd hllenges mentioned ove:
4 4 Stefn Rihter et l. (1.) Zero-Overhed indexing. We show how to effetively piggy-k sorting nd index retion on the existing HDFS uplod pipeline. This wy no dditionl MpRedue jo is required to rete those indexes nd lso no dditionl red of the dt is required t ll. In ft, the HAIL uplod pipeline is so effetive when ompred to HDFS tht the dditionl overhed for sorting nd index retion is hrdly notiele in the overll proess. Therefore, we offer win-win sitution over Hdoop MpRedue nd even over Hdoop++ [15]. We give n overview of HAIL nd its enefits in Setion 2. (2.) Per-Repli indexing. We show how to exploit the defult replition of Hdoop to support different sort orders nd indexes for eh lok repli (Setion 3). Hene, for defult replition ftor of three, up to three different sort orders nd lustered indexes re ville for proessing MpRedue jos. Thus, the likelihood to find suitle index inreses nd hene the runtime for worklod improves. Our pproh enefits from the ft tht Hdoop is only used for ppends: there re no updtes. Thus, one lok is full, it will never e hnged gin. (3.) Jo Exeution. We show how to effetively hnge the Hdoop MpRedue pipeline to exploit existing indexes (Setion 4). Our gol is to do this without hnging the ode of the MpRedue frmework. Therefore, we introdue optionl nnottions for MpRedue jos tht llow users to enrih their queries with expliit speifitions of their seletions nd projetions. HAIL tkes re of performing MpRedue jos using norml dt lok replis or pseudo dt lok replis (or even oth). In ddition, we propose new tsk sheduling, lled HAIL Sheduling, to fully exploit sttilly nd dptively indexed dt loks (Setion 7). The gol of HAIL Sheduling is twofold: (i) to redue the sheduling overhed when exeuting MpRedue jo, nd (ii) to lne the indexing effort ross omputing nodes to limit the impt of dptive indexing. (4.) Zero-Overhed Adptive indexing. We show how to effetively piggyk dptive index retion on the existing MpRedue jo exeution pipeline (Setion 5). The ide is to omine dptive indexing nd zero-overhed indexing to solve the prolem of missing indexes for evolving or unpreditle worklods. In other words, when HAIL exeutes mp redue jo with filter ondition on n unindexed ttriute, HAIL retes tht missing index for ertin frtion of the HDFS loks in prllel. We dditionlly propose set of dptive indexing strtegies tht mkes HAIL wre of the performne nd the seletivity of MpRedue jos (Setion 6). We present lzy nd eger dptive indexing, two tehniques tht llow HAIL to quikly dpt to hnges in users worklods t low indexing overhed. We then show how HAIL n deide whih dt loks to index sed on the seletivities of MpRedue jos. (5.) Exhustive vlidtion. We present n extensive experimentl omprison of HAIL with Hdoop nd Hdoop++ [15] (Setion 9). We use seven different lusters inluding physil nd virtul EC2 lusters of up to 1 nodes. A series of experiments shows the superiority of HAIL over oth Hdoop nd Hdoop++. Another series of slility experiments with different dtsets lso demonstrtes the superiority of using dptive indexing in HAIL. In prtiulr, our experimentl results demonstrte tht HAIL: (i) retes lustered indexes t uplod time lmost for free; (ii) quikly dpts to query worklods with negligile indexing overhed; nd (iii) only for the very first jo HAIL hs smll overhed over Hdoop when reting indexes dptively: ll the following jos re fster in HAIL. Notie tht, this rtile presents n extended version of the initil HAIL system [16] with the following signifint dded vlue: we enrih HAIL with the dptive indexing pipeline, tht llows HAIL to dpt to hnges in query worklods in n utomti, inrementl, nd dynmi wy (ll of ontriution Zero-Overhed Adptive indexing.); we extend the HAIL tsk sheduling in order to lne the index effort t jo exeution time nd exploit pseudo dt loks (hlf of ontriution Jo exeution.); we run lrge numer of new experiments to vlidte our dptive indexing tehniques s well s the extended HAIL tsk sheduling (one third of ontriution Exhustive vlidtion.). 2 Overview In the following, we give n overview of HAIL y ontrsting it with norml HDFS nd Hdoop MpRedue. Therey, we introdue the two indexing pipelines of HAIL. First, stti indexing llows us to rete severl lustered indexes t uplod time. Seond, HAIL dptive indexing retes dditionl indexes s yprodut of tul jo exeution, whih enles HAIL to dpt to unexpeted worklods. For more detiled ontrst to relted work see Setion 8. For now, let s onsider gin our motivting exmple: How n Bo nlyze his log file with Hdoop nd HAIL? 2.1 Hdoop nd HDFS In HDFS nd Hdoop MpRedue, Bo strts y uploding his log file to HDFS using the HDFS lient. HDFS then prtitions the file into logil HDFS loks using onstnt lok size (the HDFS defult is 64MB). Eh HDFS lok is then physilly stored three times (ssuming the defult replition ftor). Eh physil opy of lok is lled repli. Eh repli will sit on different dtnode. Therefore, t lest two dtnode filures my e survived y HDFS. Note tht HDFS keeps informtion on the different replis for n HDFS lok in entrl nmenode diretory. After uploding his log file to HDFS, Bo my run n tul MpRedue jo. Bo invokes Hdoop MpRedue through Hdoop MpRedue JoClient, whih sends his
5 Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop 5 MpRedue jo to entrl node termed JoTrker. The MpRedue jo onsists of severl tsks. A tsk is exeuted on suset of the input file, typilly n HDFS lok 3. The JoTrker ssigns eh tsk to different TskTrker, whih typilly runs on the sme mhine s n HDFS dtnode. Eh dtnode will then red its suset of the input file, i.e., set of HDFS loks, nd feed tht dt into the MpRedue proessing pipeline whih usully onsists of Mp, Shuffle, nd Redue Phse (see [13,15,14] for detiled desription). As soon s ll results hve een written to HDFS, the JoClient informs Bo tht the results re ville. Notie tht, the exeution time of the MpRedue jo is hevily influened y the size of the input dtset, euse Hdoop MpRedue reds the input dtset entirely in order to perform ny inoming MpRedue jo. 2.2 HAIL In HAIL, Bo nlyzes his log file s follows. He strts y uploding his log file to HAIL using the HAIL lient. In ontrst to the HDFS lient, the HAIL lient nlyzes the input dt for eh HDFS lok, onverts eh HDFS lok diretly to inry olumnr lyout, tht resemles PAX [3] nd sends it to three dtnodes. Then, ll dtnodes sort the dt ontined in tht HDFS lok in prllel using different sort order. The required sort orders n e mnully speified y Bo in onfigurtion file or omputed y physil design lgorithm. For eh HDFS lok, ll sorting nd index retion hppens in min memory. This is fesile s the HDFS lok size is typilly etween 64MB (defult) nd 1GB. This esily fits in the min memory of most mhines. In ddition, in HAIL, eh dtnode retes different lustered index for eh HDFS lok repli nd stores it with the sorted dt. This proess is lled the HAIL stti indexing pipeline. After uploding his log file to HAIL, Bo runs his MpRedue jos, tht n now immeditely exploit the indexes tht were reted y HAIL sttilly (i.e., t uplod time). As efore, Bo invokes Hdoop MpRedue through JoClient whih sends his MpRedue jos to the Jo- Trker. However, his MpRedue jos re slightly modified so tht the system n deide to eventully use ville indexes on the dt lok replis. For exmple, ssume tht dt lok hs three replis with lustered indexes on visitdte, drevenue, nd soureip. In se tht Bo hs MpRedue jo filtering on visitdte, HAIL uses the replis hving the lustered index on visitdte. If Bo is filtering on soureip, HAIL uses the replis hving the lustered index on soureip nd so on. To provide filover nd lod lning, HAIL my fll k to stndrd Hdoop snning for some of the loks. However, even ftoring this 3 Atully it is split. The differene does not mtter here. We will get k to this in Setion 4.2. in, Bo s queries run muh fster on verge, if indexes on the right ttriutes exist. In se tht Bo sumits jos tht filter on unindexed ttriutes (e.g., on durtion), HAIL gin flls k to stndrd full sn y hoosing ny ritrry repli, just like Hdoop. However, in ontrst to Hdoop, HAIL n index HDFS loks in prllel to jo exeution. If nother jo filters gin on the durtion field, the new jo n lredy enefit from the previously indexed loks. So, HAIL tkes inoming jos, whih hve seletion predite on urrently unindexed ttriutes, s hints for vlule dditionl lustered indexes. Consequently, the set of ville indexes in HAIL evolves with hnging worklods. We ll this proess the HAIL dptive indexing pipeline. 2.3 HAIL Benefits (1.) HAIL often improves oth uplod nd query times. The uplod is drmtilly fster thn Hdoop++ nd often fster (or only slightly slower) thn with the stndrd Hdoop even though we (i) onvert the input file into inry PAX, (ii) rete series of different sort orders, nd (iii) rete multiple lustered indexes. From the user-side, this provides win-win sitution: there is no notiele punishment for uplod. For querying, users n only win: if our indexes nnot help, we will fll k to stndrd Hdoop snning; if the indexes n help, query runtimes will improve. Why do we not hve high osts t uplod time? We silly exploit the unused CPU tiks tht re not used y stndrd HDFS. As the stndrd HDFS uplod pipeline is I/O-ound, the effort for our sorting nd index retion in the HAIL uplod pipeline is hrdly notiele. In ddition, sine we prse dt to inry while uploding, we often enefit from smller dtsets triggering less network nd disk I/O. (2.) Even if we did not rete the right indexes t uplod time, HAIL n rete indexes dptively t jo exeution time without inurring high overhed. Why don t we see high overhed? We do not need to dditionlly lod the lok dt to min memory, sine we piggyk on the reding of the mp tsks. Furthermore, HAIL retes indexes inrementlly over severl jo exeutions using different dptive indexing strtegies. (3.) We do not hnge the filover properties of Hdoop. Why is filover not ffeted? All dt stys on the sme logil HDFS lok. We just hnge the physil representtion of eh repli of n HDFS lok. Therefore, from eh physil repli we my reover the logil HDFS lok. (4.) HAIL works with existing MpRedue jos inurring only miniml hnges to those jos. Why does this work? We llow Bo to nnotte his existing jos with seletions nd projetions. Those nnottions re then onsidered y HAIL to pik the right index. Like tht, for Bo the hnges to his MpRedue jos re miniml.
6 6 Stefn Rihter et l. Network Network Bo HAILClient CL DtNode DN 1 DtNode DN 3 OK uplod notify preproess 1 onvert 2 PAX Blok Blok Metdt PCK 2 ACK PCK 1 ACK PAX Blok Blok Metdt ressemle PCK PCK 2 1 forwrd 13 uild 7 ACK HAIL Blok PAX Blok HAIL Blok 1 Blok Metdt Blok Metdt Blok 111 Metdt Index Metdt uild Index Metdt Index Index ressemle 8 forwrd PCK PCK hek ACK 2 ACK 1 ACK 2 ppend knowledge Network get lotion 3 register register HDFS NmeNode Blok diretory HAIL Repli diretory Fig. 1 The HAIL stti indexing pipeline s prt of uploding dt to HDFS 3 HAIL Zero-Overhed Stti Indexing We rete stti indexes in HAIL while uploding dt. One of the min hllenges is to support different sort orders nd lustered indexes per repli s well s to uild those indexes effiiently without muh impt on uplod times. Figure 1 shows the dt flow when Bo uplods file to HAIL. Let s first explore the detils of the stti indexing pipeline. 3.1 Dt Lyout In HDFS, for eh lok, the lient ontts the nmenode to otin the list of dtnodes tht should store the lok replis. Then, the lient sends the originl lok to the first dtnode, whih forwrds this to the seond dtnode nd so on. In the end, eh dtnode stores yte-identil opy of the originl lok dt. In HAIL, the HAIL lient preproesses the file sed on its ontent to onsider end of lines 1 in Figure 1. We prse the ontents into rows y serhing for end of line symols nd never split row etween two loks. This is in ontrst to stndrd HDFS whih splits file into HDFS loks fter onstnt numer of ytes. For eh lok the HAIL lient prses eh row ording to the shem speified y the user 4. If HAIL enounters row tht does not mth the given shem (i.e., d reord), it seprtes this reord into speil prt of the dt lok. HAIL then onverts ll HDFS loks to inry olumnr lyout tht resemles PAX 2. This llows us to index nd ess individul ttriutes more effiiently. The HAIL lient lso ollets metdt informtion from eh HDFS lok (suh s the dt shem) nd retes lok heder (Blok Metdt) for eh HDFS lok 2. We ould nively piggy-k on this existing HDFS uplod pipeline y first storing the originl lok dt s done 4 Alterntively, HAIL n lso suggest n pproprite shem to users through shem nlysis. in Hdoop nd then onverting it to inry PAX lyout in seond step. However, we would hve to re-red nd then re-write eh lok, whih would trigger one extr write nd red for eh repli, e.g., for n input file of 1GB we would hve to py 6GB extr I/O on the luster. This would led to very long uplod times. In ontrst, HAIL does not hve to py ny of tht extr I/O. However, to hieve this drmti improvement, we hve to mke nontrivil hnges in the stndrd Hdoop uplod pipeline. 3.2 Stti Indexing in the Uplod Pipeline To understnd the implementtion of stti indexing in the HAIL uplod pipeline, we first hve to nlyze the norml HDFS uplod pipeline in more detil. In HDFS, while uploding lok, the dt is further prtitioned into hunks of onstnt size 512B. Chunks re olleted into pkets. A pket is sequene of hunks plus heksum for eh of the hunks. In ddition some metdt is kept. In totl pket hs size of up to 64KB. Immeditely efore sending the dt over the network, eh HDFS lok is onverted to sequene of pkets. On disk, HDFS keeps, for eh repli, seprte file ontining heksums for ll of its hunks. Hene, for eh repli two files re reted on lol disk: one file with the tul dt nd one file with its heksums. These heksums re reused y HDFS whenever dt is send over the network. The HDFS lient (CL) sends the first pket of the lok to the first dtnode (DN 1 ) in the uplod pipeline. DN 1 splits the pket into two prts: the first ontins the tul hunk dt, the seond ontins the heksums for those hunks. Then DN 1 flushes the hunk dt to file on lol disk. The heksums re flushed to n extr file. In prllel DN 1 forwrds the pket to DN 2 whih splits nd flushes the dt like DN 1 nd in turn forwrds the pket to DN 3 whih splits nd flushes the dt s well. Yet, only DN 3 verifies the heksum for eh hunk. If the reomputed heksums for eh hunk of
7 Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop 7 pket mth the reeived heksums, DN 3 knowledges the pket k to DN 2, whih knowledges k to DN 1. Finlly, DN 1 knowledges k to CL. Eh dtnode lso ppends its ID to the ACK. Like tht only one of the dtnodes (the lst in the hin, here DN 3 s the replition ftor is three) hs to verify the heksums. DN 2 elieves DN 3, DN 1 elieves DN 2, nd CL elieves DN 1. If ny CL or DN i reeives ACKs in the wrong order, the uplod is onsidered filed. The ide of sending multiple pkets from CL is to hide the roundtrip ltenies of the individul pkets. Creting this hin of ACKs lso hs the enefit tht CL only reeives single ACK for eh pket nd not three. Notie, tht HDFS provides this heksum mehnism on top of the existing TCP/IP heksum mehnism (whih hs weker orretness gurntees thn HDFS). In HAIL, in order to reuse s muh of the existing HDFS pipeline nd yet to mke this effiient, we need to perform the following hnges. As efore, the HAIL lient (CL) gets the list of dtnodes to use for this lok from the HDFS nmenode 3. But rther thn sending the originl input, CL retes the PAX lok, uts it into pkets 4, nd sends it to DN 1 5. Whenever dtnode DN 1 DN 3 reeives pket, it does neither flush its dt nor its heksums to disk. Still, DN 1 nd DN 2 immeditely forwrd the pket to the next dtnode s efore 8. DN 3 will verify the heksum of the hunks for the reeived PAX lok 9 nd knowledge the pket k to DN 1 2. This mens the semntis of n ACK for pket of lok re hnged from pket reeived, vlidted, nd flushed to pket reeived nd vlidted. We do neither flush the hunks nor its heksums to disk s we first hve to sort the entire lok ording to the desired sort key. On eh dtnode, we ssemle the lok from ll pkets in min memory 6. This is relisti in prtie, sine min memories tend to e >1GB for ny modern server. Typilly, the size of lok is etween 64MB (defult) nd 1GB. This mens tht for the defult size we ould keep out 15 loks in min memory t the sme time. In prllel to forwrding nd ressemling pkets, eh dtnode sorts the dt, retes indexes, nd forms HAIL Blok 7, (see Setion 3.4). As prt of this proess, eh dtnode lso dds Index Metdt informtion to eh HAIL lok in order to speify the index it reted for this lok. Eh dtnode (e.g., DN 1 ) typilly sorts the dt inside lok in different sort order. It is worth noting tht hving different sort orders ross replis does not impt fult-tolerne s ll dt is reorgnized inside the sme lok only, i.e., dt is not reorgnized ross loks. Hene, ll replis of the sme HDFS lok logilly ontin the sme reords with just different order nd therefore n still t s logil replements for eh other. Additionlly, this property helps HAIL to preserve the lod lning pilities of Hdoop. For exmple, when dtnode ontining the repli with mthing sort order for ertin jo is overloded, HAIL might hoose to red from different repli on nother dtnode, just like norml Hdoop. To void overloding dtnodes in the first ple, HAIL employs round roin strtegy for ssigning sort orders to physil replis on top of the repli plement of HDFS. This mens, tht while HDFS lredy res out distriuting HDFS lok replis ross the luster, HAIL res out distriuting the sort orders (nd hene the indexes) ross those replis. As soon s dtnode hs ompleted sorting nd reting its index, it will reompute heksums for eh hunk of lok. Notie tht, heksums will differ on eh repli, s different sort orders nd indexes re used. Hene, eh dtnode hs to ompute its own heksums. Then, eh dtnode flushes the hunks nd newly omputed heksums to two seprte files on lol disk s efore. For DN 3, one ll hunks nd heksums hve een flushed to disk, DN 3 will knowledge the lst pket of the lok k to DN 1 2. After tht DN 3 will inform the HDFS nmenode out its new repli inluding its HAIL lok size, the reted indexes, nd the sort order 11 (see Setion 3.3). Dtnodes DN 2 nd DN 1 ppend their ID to eh ACK 12. Then they forwrd eh ACK k in the hin 13. DN 2 nd DN 1 will forwrd the lst ACK of the lok only if ll hunks nd heksums hve een flushed to their disks. After tht DN 2 nd DN 1 individully inform the HDFS nmenode 14. The HAIL lient lso verifies tht ll ACKs rrive in order 15. Notie, tht it is importnt to hnge the HDFS nmenode in order to keep trk of the different sort orders. We disuss these hnges in Setion HDFS Nmenode Extensions In HDFS, the entrl nmenode keeps diretory Dir lok of loks, i.e., mpping lokid Set Of DtNodes. This diretory is required y ny opertion retrieving loks from HDFS. Hdoop MpRedue exploits Dir lok for sheduling. In Hdoop MpRedue whenever split needs to e ssigned to worker in the mp phse, the sheduler looks up Dir lok in the HDFS nmenode to retrieve the list of dtnodes hving repli of the ontined HDFS lok. Then, the Hdoop MpRedue sheduler will try to shedule mp tsks on those dtnodes if possile. Unfortuntely, the HDFS nmenode does not differentite the replis w.r.t. their physil lyouts. HDFS ws simply not designed for this. Thus, from the point of view of the nmenode ll replis re yte-equivlent nd hve the sme size. In HAIL, we need to llow Hdoop MpRedue to hnge the sheduling proess to shedule mp tsks lose to replis hving suitle index otherwise Hdoop MpRedue would pik indexes rndomly. Hene, we hve to enrih the HDFS nmenode to keep dditionl informtion out the ville indexes. We do this y keeping n dditionl diretory Dir rep mpping (lokid, dtnode)
8 8 Stefn Rihter et l. HAILBlokRepliInfo. An instne of HAILBlokRepliInfo ontins detiled informtion out the types of ville indexes for repli, i.e., indexing key, index type, size, strt offsets, et. As efore, Hdoop MpRedue looks up Dir lok to retrieve the list of dtnodes hving repli for given lok. However, in ddition, HAIL looks up the min memory Dir rep to otin the detiled HAILBlok- RepliInfo for eh repli, i.e., one min memory lookup for eh repli. HAILBlokRepliInfo is then exploited y HAIL to hnge the sheduling strtegy of Hdoop (we will disuss this in detil in Setion 4). 3.4 An Index Struture for Zero-Overhed Indexing In this setion, we riefly disuss our hoie of n pproprite index struture for indexing t miniml osts in HAIL s give some detils on our onrete implementtion. Why Clustered Indexes? An interesting question is why we fous on lustered indexes. For indexing with miniml overhed, we require n index struture tht is hep to rete in min memory, hep to write to disk, nd hep to query from disk. We tried numer of indexes in the eginning of the projet inluding orse-grnulr indexes nd unlustered indexes. After some experimenttion we quikly disovered tht sorting nd index retion in min memory is so fst tht tehniques like prtil or orse-grnulr sorting do not py off for HAIL. Whether you py three or two seonds for sorting nd indexing per lok during uplod is hrdly notiele in the overll uplod proess of HDFS. In ddition, mjor prolem with unlustered indexes is tht they re only ompetitive for very seletive queries s they my trigger onsiderle rndom I/O for non-seletive index trversls. In ontrst, lustered indexes do not hve tht prolem. Whtever the seletivity, we will red the lustered index nd sn the qulifying loks. Hene, even for very low seletivities the only overhed over sn is the initil index node trversl, whih is negligile. Moreover, s unlustered indexes re dense y definition, they require onsiderly more dditionl spe on disk nd require more write I/O thn sprse lustered index. Thus, using unlustered indexes would severely ffet uplod times. Yet, n interesting diretion for future work would e to extend HAIL to support dditionl indexes tht might oost performne, suh s itmp indexes nd inverted lists. 4 HAIL Jo Exeution We now fous on generl jo exeution in HAIL. First, we present from Bo s perspetive how he n enhne MpRedue jos to enefit from HAIL stti indexing (Setion 4.1). We will explin how Bo n write his MpRedue jos (lmost) s efore nd run them extly s when using Hdoop MpRedue. After tht we nlyze from the system s perspetive the stndrd Hdoop MpRedue pipeline nd then ompre how HAIL exeutes jos (Setion 4.2). We will see tht HAIL requires only smll hnges in the Hdoop MpRedue frmework, whih mkes HAIL esy to integrte into newer Hdoop versions (Setion 4.3). Figure 2 shows the query pipeline when Bo runs MpRedue jo on HAIL. Finlly, we riefly disuss the se of seletions on unindexed ttriutes, i.e., when jo requests stti index tht ws not reted, s motivtion for HAIL dptive indexing (Setion 4.4). 4.1 Bo s Perspetive In Hdoop MpRedue, Bo writes MpRedue jo, whih inludes jo onfigurtion lss, mp funtion, nd redue funtion. In HAIL, the MpRedue jo remins the sme (see 1 nd 2 in Figure 2), ut with three tiny hnges: (1) Bo speifies the HilInputFormt (whih uses Hil- ReordReder internlly) in the min lss of the MpRedue jo. By doing this, Bo enles his MpRedue jo to red HAIL Bloks (see Setion 3.2). (2) Bo nnottes his mp funtion to speify the seletion predite nd the projeted ttriutes required y his MpRedue jo 5. For exmple, ssume tht Bo wnts to write MpRedue jo tht performs the following SQL query (exmple from Introdution): SELECT soureip FROM UserVisits WHERE visitdte BETWEEN AND To exeute this query in HAIL, Bo dds to his mp funtion HilQuery nnottion s etween( , 2-1-1)", projetion={@1}) void mp(text key, Text v) {... } Where the in the filter vlue nd the in the projetion vlue denote the ttriute position in the UserVisits reords. In this exmple the third ttriute is visitdte nd the first ttriute is soureip. By nnotting his mp funtion s mentioned ove, Bo indites tht he wnts to reeive in the mp funtion only the projeted ttriute vlues of those tuples qulifying the speified seletion predite. In se Bo does not speify filter predites, HAIL will perform full sn s the stndrd Hdoop. At query time, if the HilQuery nnottion is set, HAIL heks (using the Index Metdt of dt lok) whether n index exists on the filter ttriute. Using suh n index llows us to speed up the jo exeution. HAIL lso uses the Blok Metdt to determine the shem of dt lok. This llows HAIL to red the ttriutes speified in the filter nd projetion prmeters only. (3) Bo uses HilReord ojet s input vlue in the mp funtion. This llows Bo to diretly red the projeted ttriutes without splitting the reord into ttriutes s he 5 Alterntively, HAIL llows Bo to speify the seletion predite nd the projeted ttriutes in the jo onfigurtion lss.
9 Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop 9 Bo's Perspetive System's Perspetive Bo 2 run Jo JoClient Hdoop MpRedue MpRedue Pipeline Pipeline Split Phse Sheduler Mp Phse JoTrker TskTrker HAILReordReder 1 write Jo MpRedue Jo for eh lok lok i { lotion = lok i.gethostwithindex(@3); reteinputsplit(lotion); } 3 send splits[] for eh split split i { llote split i to losest DtNode storing lok i } 5 llote Mp Tsk - Index ess or full sn - Post-filtering - For eh reord invoke mp(hilreord) - Adptive indexing? Min Clss mp(...) redue(...) hose 4 omputing Node 6 red lok i 7 store filter="@3 etween( , 2-1-1)", projetion={@1}) void mp(text k, HilReord v) { output(v.getint(1), null); }... HAIL Annottion Fig. 2 The HAIL query pipeline DN1 lok i lok i lok i DN3 DN4 DN5 DN6 DN7 DNn HDFS HDFS... would do it in the stndrd Hdoop MpRedue. For exmple, using stndrd Hdoop MpRedue Bo would write the following mp funtion to perform the ove SQL query: Mp Funtion for Hdoop MpRedue (pseudo-ode): void mp(text key, Text v) { String[] ttr = v.tostring().split(","); if (DteUtils.isBetween(ttr[2], " ", "2-1-1")) output(ttr[], null); } Using HAIL Bo writes the following mp funtion: Mp Funtion for HAIL: void mp(text key, HilReord v) { output(v.getint(1), null); } Notie tht, Bo now does not hve to filter out the inoming reords, euse this is utomtilly hndled y HAIL vi the HilQuery nnottion (s mentioned erlier). This nnottion is illustrted in Figure System Perspetive In Hdoop MpRedue, when Bo sumits MpRedue jo JoClient instne is reted. The min gol of the Jo- Client is to opy ll the resoures needed to run the MpRedue jo (e.g. metdt nd jo lss files). But lso, the JoClient fethes ll the lok metdt (BlokLotion[]) of the input dtset. Then, the JoClient logilly reks the input into smller piees lled input splits (split phse in Figure 2) s defined in the InputFormt. By defult, the Jo- Client omputes input splits suh tht eh input split mps to distint HDFS lok. An input split defines the input of mp tsk while n HDFS lok is horizontl prtition of dtset stored in HDFS (see Setion 3.1 for detils on how HDFS stores dtsets). For sheduling purposes, the JoClient retrieves for eh input split ll dtnode lotions hving repli of tht HDFS lok. This is done y lling gethosts() of eh BlokLotion. For instne, in Figure 2, dtnodes DN3, DN5, nd DN7 re the split lotions for split 42 sine lok 42 is stored on suh dtnodes. After this split phse, the JoClient sumits the jo to the JoTrker with the set of input splits to proess 3. Among other opertions, the JoTrker retes mp tsk for eh input split. Then, for eh mp tsk, the JoTrker deides on whih omputing node to shedule the mp tsk, using the split lotions 4. This deision is sed on dt-lolity nd vilility [13]. After this, the JoTrker llotes the mp tsk to the TskTrker (whih performs mp nd redue tsks) running on tht omputing node 5. Only then, the mp tsk n strt proessing its input split. The mp tsk uses ReordReder UDF in order to red its input dt lok i from the losest dtnode 6. Interestingly, it is the lol HDFS lient running on the node where the mp tsk is running tht deides from whih dtnode mp tsk will red its input nd not the Hdoop MpRedue sheduler. This is done when the ReordReder sks for the input strem pointing to lok i. It is worth notiing tht the HDFS lient hooses dtnode from the set of ll dtnodes storing repli of lok 42 (vi the gethosts() method) rther thn from the lotions given y the input split. This mens tht mp tsk might eventully end up reding its input dt from remote node even though it is ville lolly. One the input strem is opened, the ReordReder reks lok 42 into reords nd mkes ll to the mp funtion for eh reord. Assuming tht the MpRedue jo onsists of mp phse only, the mp tsk then writes its output k to HDFS 7. See [15,44, 14] for more detils on the MpRedue exeution pipeline. In HAIL, it is ruil to e non-intrusive to the stndrd Hdoop exeution pipeline so tht users run MpRedue jos extly s efore. However, supporting per-repli indexes in n effiient wy nd without signifint hnges to the stndrd exeution pipeline is hllenging for sev-
10 1 Stefn Rihter et l. erl resons. First, the JoClient nnot simply rete input splits sed only on the defult lok size s eh HDFS lok repli hs different size (euse of indexes). Seond, the JoTrker n no longer shedule mp tsks sed on dt-lolity nd nodes vilility only. The JoTrker now hs to onsider the existing indexes for eh HDFS lok. Third, the ReordReder hs to perform either index ess or full sn of HDFS loks without ny intertion with users, e.g. depending on the vilility of suitle indexes. Fourth, the HDFS lient nnot nymore open n input strem to given HDFS lok sed on dt-lolity nd nodes vilility only: it hs to onsider index lolity nd vilility s well. HAIL overomes these issues y minly providing two UDFs: the HilInputFormt nd the HilReordReder. Notie, tht y using UDFs we llow HAIL to e esy to integrte into newer versions of Hdoop MpRedue. We disuss these two UDFs in the following. 4.3 HilInputFormt nd HilReordReder HAILInputFormt implements different splitting strtegy thn stndrd InputFormts. This strtegy llows HAIL to redue the numer of mp wves per jo, i.e., the mximum numer of mp tsks per mp slot required to omplete this jo. Therey, the totl sheduling overhed of MpRedue jos is drstilly redued. We disuss the detils of the HAIL Splitting strtegy in Setion 7. HAILReordReder is responsile for retrieving the reords tht stisfy the seletion predite of MpRedue jos (s illustrted in the MpRedue Pipeline of Figure 2). Those reords re then pssed to the mp funtion. For exmple in Bo s query of Setion 4.1, we need to find ll reords hving visitdte etween nd To do so, for eh dt lok required y the jo, we first try to open n input strem to lok repli hving the required index. For this, HAIL instruts the lol HDFS Client to use the newly introdued gethostswithindex() method of eh BlokLotion so s to hoose the losest dtnode with the desired index. Let us first fous on the se where suitle, sttilly reted index is ville so tht HAIL n open n input strem to n indexed repli. One tht input strem hs een opened, we use the informtion out seletion predites nd ttriute projetions from the HilQuery nnottion or from the jo onfigurtion file. When performing n index-sn, we red the index entirely into min memory (typilly few KB) to perform n index lookup. This lso implies reding the qulifying lok prts from disk into min memory nd post-filtering reords (see Setion 3.4). Then, we reonstrut the projeted ttriutes of qulifying tuples from PAX to row lyout. In se tht no projetion ws speified y users, we then reonstrut ll ttriutes. Finlly, we mke ll to the mp funtion for eh qulifying tuple. For d reords (see Setion 3.1), HAIL psses them diretly to the mp funtion, whih in turn hs to del with them (just like in stndrd Hdoop MpRedue). For this, HAIL psses reord to the mp funtion with flg to indite d reord or not. 4.4 Prolem: Missing Stti Indexes Finlly, let us now disuss the seond se when Bo sumits jo whih filters on n unindexed ttriute (e.g. on durtion). Here, the HilReordReder must ompletely sn the required ttriutes of unindexed loks, pply the seletion predite nd perform tuple reonstrution. Notie tht, with stti indexing, there is no wy for HAIL to overome the prolem of missing indexes effiiently. This mens tht when the ttriutes used in the seletion predites of the worklod hnge over time, the only wy to dpt the set of ville indexes is to uplod the dt gin. However, this hs the signifint overhed of n dditionl uplod, whih goes ginst the priniple of zero-overhed indexing. Thus, HAIL introdues n dptive indexing tehnique tht offers muh more elegnt nd effiient solution to this prolem. We disuss this tehnique in the following Setion. 5 HAIL Zero-Overhed Adptive Indexing We now disuss the dptive indexing pipeline of HAIL. The ore ide is to rete missing ut promising indexes s yproduts of full sns in the mp phse of MpRedue jos. Similr to the stti indexing pipeline, our gol is gin to ome loser towrds zero overhed indexing. Therefore, we dopt two importnt priniples from our stti indexing pipeline. First, we piggyk gin on proedure tht is nturlly reding dt from disk to min memory. This llows HAIL to ompletely sve the dt red ost for dptive index retion. Seond, s mp tsks re usully I/Oound, HAIL gin exploits unused CPU time when omputing lustered indexes in prllel to jo exeution. In Setion 5.1, we strt with generl overview of the HAIL dptive indexing pipeline. In Setion 5.2, we fous on the internl omponents for uilding nd storing lustered indexes inrementlly. In Setion 5.3, we present how HAIL esses the indexes reted t jo runtime in wy tht is trnsprent to the MpRedue jo exeution pipeline. Finlly, in Setion 6, we introdue three dditionl dptive indexing tehniques tht mke the indexing overhed over MpRedue jos lmost invisile to users. 5.1 HAIL Adptive Indexing in the Exeution Pipeline For our motivting exmple, let s ssume Bo ontinues to nlyze his logs nd noties some suspiious tivities, e.g. mny user visits with very short durtion, inditing spm ot tivities. Therefore, Bo suddenly needs different jos for his nlysis tht selets user visits with short durtions. However, rell tht unfortuntely he did not rete stti index on ttriute durtion t uplod time whih would help
11 Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop 11 HAIL Input Split 1 proess Mp Redue HAILReordReder Blok 42 Blok Metdt Index Metdt Index TskTrker 3 Detil View of TskTrker 5 red 3 mp lok Mpper mp(k, V) {...} d TskTrker 5 m pss to indexer 2 6 AdptiveIndexer NmeNode Pseudo Blok 42 Blok 42 Blok 42 Blok 42 HDFS Repli Repli Repli Repli + d... DN 3... DN 5... DN 7... Fig. 3 HAIL dptive indexing pipeline. write d Blok 42 Blok Metdt Index Metdt Index d register 7 TskTrker 7 for these new jos. In generl, s soon s Bo (or one of his ollegues) sends new jo (sy jo d ) with seletion predite on n unindexed ttriute (e.g. on ttriute durtion, whih we will denote s d in the following.), HAIL nnot enefit from index sns nymore. However, HAIL tkes these jos s hints on how to dptively improve the repertoire of indexes for future jos. HAIL piggyks the retion of lustered index over ttriute durtion on the exeution of jo d. Without ny loss of generlity, we ssume tht jo d projets ll ttriutes from its input dtset. Figure 3 illustrtes the generl workflow of the HAIL dptive indexing pipeline. The figure shows how HAIL proesses mp tsks of jo d when no suitle index is ville (i.e., when performing full sn) in more detil. As soon s HAIL shedules mp tsk to speifi TskTrker 6, e.g. TskTrker 5, the HAILReordReder of the mp tsk first reds the metdt from the HAILInputSplit 1 7. With this metdt, the HAILReordReder heks whether suitle index is ville for its input dt lok (sy lok 42 ). As no index on ttriute d is ville, the HAILReordReder simply opens n input strem to the lol repli of lok 42 stored on DtNode 5. Then, the HAILReordReder: (i) lods ll vlues of the ttriutes required y jo d from disk to min memory 2 ; (ii) reonstruts reords (s our HDFS loks re in olumnr lyout); nd (iii) feeds the mp funtion with eh reord 3. Here lies the euty of HAIL: n HDFS lok tht is potentil ndidte for indexing ws ompletely trnsferred to min memory s prt of the jo exeution proess. In ddition to feeding the entire lok 42 to the mp funtion, HAIL n rete lustered index on ttriute d to speed up future jos. For this, the HAILReordReder psses lok 42 to the AdptiveIndexer s soon s the mp funtion finished proessing this dt lok 4. 8 The AdptiveIndexer, in turn, sorts the dt in lok 42 ording to ttriute d, ligns other 5 6 A Hdoop instne responsile to exeute mp nd redue tsks. 7 Tht ws otined from the HAILInputFormt vi getsplits(). 8 Notie tht, ll mp tsks (even from different MpRedue jos) running on the sme node intert with the sme AdptiveIndexer inttriutes through reordering, nd retes sprse lustered index 5. Finlly, the AdptiveIndexer stores this index with opy of lok 42 (sorted on ttriute d) s pseudo dt lok repli 6. Additionlly, the AdptiveIndexer registers the new reted index for lok 42 with the HDFS NmeNode 7. In ft, the implementtion of the dptive indexing pipeline solves some interesting tehnil hllenges. We disuss the pipeline in more detil in the reminder of this setion. 5.2 AdptiveIndexer Adptive indexing is n utomti proess tht is not expliitly requested y users nd therefore should not unexpetedly impose signifint performne penlties on users jos. Piggyking dptive indexing on mp tsks llows us to ompletely sve the red I/O-ost. However, the indexing effort is shifted to query time. As result, ny dditionl time involved in indexing will potentilly dd to the totl runtime of MpRedue jos. Therefore, the first onern of HAIL is: how to mke dptive index retion effiient? To overome this issue, the ide of HAIL is to run the mpping nd indexing proesses in prllel. However, interleving mp tsk exeution with indexing ers the risk of re onditions etween mp tsks nd the AdptiveIndexer on the dt lok. In other words, the AdptiveIndexer might potentilly reorder dt inside dt lok, while the mp tsk is still onurrently reding the dt lok. One might think out opying dt loks efore indexing to del with this issue. Nevertheless, this would entil the dditionl runtime nd memory overhed of opying suh memory hunks. For this reson, HAIL does not interleve the mpping nd indexing proesses on the sme dt lok. Insted, HAIL interleves the indexing of given dt lok (e.g. lok 42 ) with the mpping phse of the sueeding dt lok (e.g. lok 43 ), i.e., HAIL keeps two HDFS loks in memory t the sme time. For this, HAIL uses produeronsumer pttern: mp tsk ts s produer y offering dt lok to the AdptiveIndexer, vi ounded loking queue, s soon s it finishes proessing the dt lok; in turn, the AdptiveIndexer is onstntly onsuming dt loks from this queue. As result, HAIL n perfetly interleve mp tsks with indexing, exept for the first nd lst dt lok to proess in eh node. It is worth noting tht the queue exposed y the AdptiveIndexer is llowed to rejet dt loks in se ertin limit of enqueued dt loks is exeeded. This prevents the AdptiveIndexer to run out of memory euse of overlod. Still, future MpRedue jos with seletion predite on the sme ttriute (i.e., on ttriute d) n t their turn tke re of indexing the rejeted dt loks. One the AdptiveIndexer pulls dt lok from its queue, it proesses the dt lok using two stne. Hene, the AdptiveIndexer n end up y indexing dt loks from different MpRedue jos t the sme time.
12 12 Stefn Rihter et l. IndexBuilderDemon AdptiveIndexer Blok 42 Blok 42 1 Blok Metdt 5 6 Bloking Queue Index Metdt Bloking Queue Index Index d offer tke put tke d d d proess Sort + PV Reorder Sprse Index Fig. 4 AdptiveIndexer internls. d Index d 8 IndexWriterDemon d DtNode 5 Blk 42 Blok 42 Blok Metdt Index Metdt Index d Blk 42 d store Pseudo register 9 7 NmeNode BLK 42 DN1 : DN5 :, d DNn : internl omponents: the IndexBuilder nd the IndexWriter. Figure 4 illustrtes the pipeline of these two internl omponents, whih we disuss in the following. The IndexBuilder is demon thred tht is responsile for reting sprse lustered indexes on dt loks in the dt queue. With this im, the IndexBuilder is onstntly pulling one dt lok fter nother from the dt lok queue 1. Then, for eh dt lok, the IndexBuilder strts with sorting the ttriute olumn to index (ttriute d in our exmple) 2. Additionlly, the IndexBuilder uilds mpping {old position new position} for ll vlues s permuttion vetor. After tht, the IndexBuilder uses the permuttion vetor to reorder ll other ttriutes in the offered dt lok 3. One the IndexBuilder finishes sorting the entire dt lok on ttriute d, it uilds sprse lustered index on ttriute d 4. Then, the IndexBuilder psses the newly indexed dt lok to the IndexWriter 5. The IndexBuilder lso ommunites with the IndexWriter vi loking queue. This llows HAIL to prllelise indexing with the I/O proess for storing newly indexed dt loks. The IndexWriter is nother demon thred nd responsile for persisting indexes reted y the IndexBuilder to disk. The IndexWriter ontinuously pulls newly indexed dt loks from its queue in order to persist them on HDFS 6. One the IndexWriter pulls newly indexed dt lok (sy lok 42 ), it retes the lok metdt nd index metdt for lok Notie tht newly indexed dt lok is just nother repli of the logil dt lok, ut with different sort order. For instne, in our exmple of Setion 5.1, reting n index on ttriute d for lok 42 leds to hving four dt lok replis for lok 42 : one repli for eh of the first four ttriutes. The IndexWriter retes pseudo dt lok repli 8 nd registers the new index with the NmeNode 9. This llows HAIL to onsider the newly reted indexes in future jos. In the following we disuss pseudo dt lok replis in more detil. 5.3 Pseudo Dt Blok Replis The IndexWriter ould simply write new indexed dt lok s nother repli. However, HDFS supports dt lok replition only t the file level, i.e., HDFS replites ll the dt loks of given dtset the sme numer of times. This goes ginst the inrementl nture of HAIL. A pseudo dt lok repli is silly logil opy of dt lok nd llows HAIL to keep different replition ftor on lok sis rther thn on file sis. Therefore, we store eh pseudo dt lok repli in new HDFS file with replition ftor one. Hene, the NmeNode does not reognise it s norml dt lok repli nd insted simply sees the pseudo dt lok repli s nother index ville for the HDFS lok. To void shipping ross nodes, eh IndexWriter ims t storing the pseudo dt lok replis lolly. The reted HDFS files follow nming onvention, whih inludes the lok id nd the index ttriute, to uniquely identify pseudo dt lok repli. As pseudo dt lok replis re stored in different HDFS files thn norml dt lok replis, three importnt questions rise: How to ess pseudo dt lok replis in n invisile wy for users? HAIL hieves this trnspreny vi the HAIL- ReordReder. Users ontinue nnotting their mp funtions (with seletion predites nd projetions). Then, the HAILReordReder tkes re of utomtilly swithing from norml to pseudo dt lok replis. For this, the HAILReordReder uses the HAILInputStrem, wrpper of the Hdoop FSInputStrem. How to mnge nd limit the storge spe onsumed y the pseudo dt lok replis? This question is relted to optimiztion prolems from physil dtse design, i.e. index seletion. Given ertin storge udget, the question is whih indexes for n HDFS lok to drop, to hieve the highest worklod enefit without exeeding the storge onstrint? Solving this prolem is eyond the sope of this rtile nd is sujet to ongoing work. A simple implementtion ould orrow ides from uffer replement strtegies to ttk the prolem, e.g. LRU or repling the lest enefiil indexes. How does the mount of reltively smll files reted for pseudo dt lok replis impt HDFS performne? The metdt storge overhed for eh file entry with one ssoited lok in the NmeNode is out 15 ytes. This mens, tht given 6GB of free hep spe on the NmeNode nd n HDFS lok size of 256MB, HAIL n support more thn 1PB of dt in pseudo lok replis. Additionlly, future Hdoop versions will support federtion of NmeNodes to inrese pity, vilility, nd lod lning. This would llevite the mentioned prolem even further. Furthermore, sequentil red performne of file tht is stored in pseudo dt lok replis mthes the performne of norml HDFS files. This is euse the involved mount of seeks nd DtNode hops for swithing etween pseudo dt lok replis is omprle to reding over lok oundries when snning norml HDFS files.
13 Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop 13 HAILInputSplit HDFS Blok Replis /in/dt1 BID 42 PATH INDEX OFF 4711 [,,] PSEUDO[d] Jo Projetion d Seletion d == 23 Blok 42 Repli 1 Blok 43 Repli 1 1 Index? Blok 42 Repli 2 Blok 43 Repli 2 /in/dt1 HAILReordReder Open InputStrem HdoopInputStrem Any HDFS Repli HAILInputStrem Blok 7 Metdt 11 Adptive? Index 1 Metdt No Yes 11 2 Mthing Mthing d 11 3 HDFS Repli Pseudo Repli Index d only Tuples offset = offset = Blok 42 Repli 3 Blok 43 Repli 3 HDFS Fig. 5 HAILReordReder internls. 5.4 HAIL ReordReder Internls 4 Blok 42 d Pseudo /i/lk_42/d Blok 43 d Pseudo /i/lk_43/d Blok Metdt Pseudo Blok Replis d... ll Tuples for eh Tuple Filter d == 23m Figure 5 illustrtes the internl pipeline of the HAILReordReder when proessing given HAILInputSplit. When mp tsk strts, the HAILReordReder first reds the metdt of its HAILInputSplit in order to hek if there exists suitle index to proess the input dt lok (lok 42 ) 1. If suitle index is ville, the HAILReordReder initilises the HAILInputStrem with the seletion predite of jo d s prmeter 2. Internlly, the HAILInputStrem heks if the index resides in norml or pseudo dt lok repli 3. This llows the HAILInputStrem to open n input strem to the right HDFS file. This is euse norml nd pseudo dt lok replis re stored in different HDFS files. While ll norml dt lok replis elong to the sme HDFS file, eh pseudo dt lok repli elongs to different HDFS file 4. In our exmple the index on ttriute d for lok 42 resides in pseudo dt lok repli. Therefore, the HAILInputStrem opens n input strem to the HDFS file /pseudo/lk 42/d 5. As result, the HAILReordReder does not re from whih file it is reding, sine norml nd pseudo dt lok replis hve the sme formt. Therefore, swithing etween norml nd pseudo dt lok repli is not only invisile to users, ut lso to the HAILReordReder. The HAILReordReder just reds the lok nd index metdt using the HAILInputStrem 6. After performing n index lookup for the seletion predite of jo d, the HAILReordReder lods only the projeted ttriutes (,,, nd d) from the qulifying tuples (e.g. tuples with rowids in ) 7. Finlly, the HAILReordReder forms key/vlue-pirs nd psses only qulifying pirs to the mp funtion 8. In se tht no suitle index exists, the HAILReordReder tkes the Hdoop InputStrem, whih opens n input strem to ny norml dt lok repli, nd flls k to full sn (like stndrd Hdoop MpRedue). mp 6 Adptive Indexing Strtegies In the previous setion we disussed the ore priniples of the HAIL dptive indexing pipeline. Now, we introdue three strtegies tht llow HAIL to improve the performne of MpRedue jos. We first present lzy dptive indexing nd eger dptive indexing, two tehniques tht llow HAIL to ontrol its inrementl indexing mehnism with respet to runtime overhed nd onvergene rte. We then disuss how HAIL n prioritise dt loks for indexing sed on their seletivity. Finlly, we introdue seletivitysed indexing, tehnique to deide whih loks to offer to the dptive indexer sed on jo seletivity. 6.1 Lzy Adptive Indexing The loking queues used y the AdptiveIndexer llow us to esily protet HAIL ginst CPU overloding. However, writing pseudo dt lok replis n lso slow down the prllel red nd write proesses of MpRedue jos. In ft, the negtive impt of extr I/O opertions n e high, s MpRedue jos re typilly I/O-ound. As result, HAIL s whole might eome slower even if the AdptiveIndexer n omputtionlly keep up with the jo exeution. So, the question tht rises is: how to write pseudo dt lok replis effiiently? HAIL solves this prolem y mking indexing inrementl, i.e., HAIL spreds index retion over multiple MpRedue jos. The gol is to lne index retion ost over multiple MpRedue jos so tht users pereive smll (or no) overhed in their jos. To do so, HAIL uses n offer rte, whih is rtio tht limits the mximum numer of pseudo dt lok replis (i.e., numer of dt loks to index) to rete during single MpRedue jo. For exmple, using n offer rte of 1%, HAIL indexes in single MpRedue jo t mximum one dt lok out of ten proessed dt loks (i.e., HAIL only indexes 1% of the totl dt loks). Notie tht, onseutive dptive indexing jos with seletions on the sme ttriute lredy enefit from pseudo dt lok replis reted during previous jos. This strtegy hs two mjor dvntges. First, HAIL n redue the dditionl I/O introdued y indexing to level tht is eptle for the user. Seond, the indexing effort done y HAIL for ertin ttriute is proportionl to the numer of times seletion is performed on tht ttriute. Another dvntge of using n offer rte is tht users n deide how fst they wnt to onverge to omplete index, i.e., ll dt loks re indexed. For instne, using n offer rte of 1%, HAIL would require 1 MpRedue jos with seletion predite on the sme ttriute to onverge to omplete index (i.e. until ll HDFS loks re fully indexed). Like tht, on the one hnd, the investment in terms of time nd spe for MpRedue jos with seletion pred-
14 14 Stefn Rihter et l. ites on unfrequent ttriutes is minimized. On the other hnd, MpRedue jos with seletion predites on frequent ttriutes quikly onverge to ompletely indexed opy. 6.2 Eger Adptive Indexing Lzy dptive indexing llows HAIL to esily throttle down dptive indexing efforts to n eptle (or even invisile) degree for users (see Setion 6.1). However, let us mke two importnt oservtions tht ould mke onstnt offer rte not desirle for ertin users: (1.) Using onstnt offer rte, the jo runtime of onseutive MpRedue jos hving filter ondition on the sme ttriute is not onstnt. Insted, they hve n lmost linerly deresing runtime up to the point where ll loks re indexed. This is euse the first MpRedue jo is the only to perform full sn over ll the dt loks of given dtset. Conseutive jos, even when indexing nd storing the sme mount of loks, re likely to run fster s they enefit from ll indexing work of their predeessors. (2.) HAIL tully delys indexing y using n offer rte. The trdeoff here is tht using lower offer rte leds to lower indexing overhed, ut it requires more MpRedue jos to index ll the dt loks in given dtset. However, some users might wnt to limit the experiened indexing overhed nd still desire to enefit from omplete indexing s soon s possile. Therefore, we propose n eger dptive indexing strtegy to del with this prolem. The si ide of eger dptive indexing is to dynmilly dpt the offer rte for MpRedue jos ording to the indexing work hieved y previous jos. In other words, eger dptive indexing tries to exploit the sved runtime nd reinvest it s muh s possile into further indexing. To do so, HAIL first needs to estimte the runtime gin (in given MpRedue jo) from performing n index sn on the lredy reted pseudo dt lok replis. For this, HAIL uses ost model to estimte the totl runtime, T jo, of given MpRedue jo (Eqution 1). Tle 1 lists the prmeters we use in the ost model. T jo = T is + t f sw n f sw + T idxoverhed. (1) We define the numer of mp wves performing full sn, n f sw, s n loks n idxbloks n slots. Intuitively, the totl runtime T jo of jo onsists of three prts. First, the time required y HAIL to proess the existing pseudo dt lok replis, i.e., ll dt loks hving relevnt index, T is. Seond, the time required y HAIL to proess the dt loks without relevnt index, t f sw n f sw. Third, the time overhed used y dptive indexing, T idxoverhed. 9 This overhed depends on the numer of dt loks tht re offered to the AdptiveIndexer nd the verge time overhed oserved for indexing 9 It is worth noting tht T idxoverhed denotes only the dditionl runtime tht MpRedue jo hs due to dptive indexing. Tle 1 Cost model prmeters. Nottion n slots n loks n idxbloks n f sw t f sw t idxoverhed T idxoverhed T is T jo T trget ρ Desription The numer of mp tsks tht n run in prllel in given Hdoop luster The numer of dt loks of given dtset The numer of loks with relevnt index The numer of mp wves performing full sn The verge runtime of mp wve performing full sn (without dptive indexing overhed) The verge time overhed of dptive indexing in mp wve The totl time overhed of dptive indexing The totl runtime of the mp wves performing n index sn The totl runtime of given jo The trgeted totl jo runtime The rtio of dt loks (w.r.t. n loks ) offered to the AdptiveIndexer lok. Formlly, we define T idxoverhed s follows: T idxoverhed = t idxoverhed min ( ρ n loks n slots, n f sw ). (2) We n use this model to utomtilly lulte the offer rte ρ in order to keep the dptive indexing overhed eptle for users. Formlly, from Equtions 1 nd 2, we dedut ρ s follows: ρ = T trget T is t f sw n f sw t idxoverhed n. loks n slots Therefore, given trget jo runtime T trget, HAIL n utomtilly set ρ in order to fully spent its time udget for reting indexes nd use the gined runtime in the next jos either to speed up the jos or to rete even more indexes. Usully, we hoose T trget to e equl to the runtime of the very first jo so tht users n oserve stle runtime till lmost everything is indexed. However, users n set T trget to ny time udget in order to dpt the indexing effort to their needs. Notie tht, sine lredy indexed pseudo dt lok replis re not offered gin to the AdptiveIndexer, HAIL first proesses pseudo dt lok replis nd mesures T is, efore deiding wht offer rte to use for the unindexed loks. The times t f sw (from Eqution 1) nd t idxoverhed (from Eqution 2) n e mesured in lirtion jo or given y users. On the one hnd, HAIL n now dpt the offer rtes to the performne gins otined from performing index sns over the lredy indexed dt loks. On the other hnd, y grdully inresing the offer rte, eger dptive indexing prioritises omplete index onvergene over erly runtime improvements for users. Thus, users no longer experiene n inrementl nd liner speed up in jo performne until the index is eventully omplete, ut insted they experiene shrp improvement when HAIL pprohes to omplete index. In summry, esides limiting the overhed of dptive indexing, the offer rte n lso e onsidered s tuning kno to trde erly runtime improvements with fster indexing.
15 Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop Seletivity-sed Adptive Indexing Erlier, we sw tht HAIL uses n offer rte to limit the numer of dt loks to index in single MpRedue jo. For this, HAIL uses round roin poliy to selet the dt loks to pss to the AdptiveIndexer. This sounds resonle under the ssumption tht dt is uniformly distriuted. However, dtsets re typilly skewed in prtie nd hene some dt loks might ontins more qulifying tuples thn others under given query worklod. Consequently, indexing highly seletive dt loks efore other dt loks promises higher performne enefits. Therefore, HAIL n lso use seletivity-sed dt lok seletion pproh for deiding whih dt loks to use. The overll gol is to optimize the use of ville omputing resoures. In order to mximize the expeted performne improvement for future MpRedue jos running on prtilly indexed dtsets, we prioritize HDFS loks with higher seletivity. The ig dvntge of this pproh is tht users n pereive higher improvements in performne for their MpRedue jos from the very first runs. Additionlly, s side-effet of using this pproh, HAIL n dpt fster to the seletion predites of MpRedue jos. However, how n HAIL effiiently otin the seletivities of dt loks? For this, HAIL exploits the nturl proess of mp tsks to propose dt loks to the AdptiveIndexer. Rell tht mp tsk psses dt lok to the AdptiveIndexer one the mp tsk finished proessing the lok. Thus, HAIL n otin the urte seletivity of dt lok y piggyking on the mp phse: when the dt lok is filtered ording to the provided seletion predite. This llows HAIL to hve perfet knowledge out seletivities for free. Given the seletivity of dt lok, HAIL n deide if it is worth to index the dt lok or not. In our urrent HAIL prototype, mp tsk proposes dt lok to the AdptiveIndexer if the perentge of qulifying tuples in the dt lok is t most 8%. However, users n dpt this threshold to their pplitions. Notie tht with the sttistis on dt lok seletivities, HAIL n lso deide whih indexes to drop in se of storge limittions. However, disussion on n index evition strtegy is out of the sope of this rtile. 7 HAIL Splitting nd Sheduling We now disuss how HAIL retes nd shedules mp tsks for ny inoming MpRedue jo. In ontrst to the Hdoop MpRedue InputFormt, the HilInputFormt uses more elorte splitting poliy, lled HilSplitting. The overll ide of HilSplitting is to mp one input split to severl dt loks whenever MpRedue jo performs n index sn over its input. In the eginning, HilSplitting divides ll input dt loks into two groups B i nd B n. Where B i ontins loks tht hve t lest one repli with mthing index (i.e., hving relevnt repli) nd B n ontins loks with no relevnt repli. Then, the min gol of the HilSplitting is to omine severl dt loks from B i into one input split. For this, Hil- Splitting first prtitions dt loks from B i ording to the lotions of their relevnt repli in order to improve dt lolity. As result of this proess, HilSplitting produes s mny prtitions of loks s there re dtnodes storing t lest one indexed lok of the given input. Then, for eh prtition of dt loks, HilSplitting retes s mny input splits s there exists mp slots per TskTrker. Thus, HAIL redues the numer of mp tsks nd hene redues the ggregted osts of initilizing nd finlizing mp tsks. The reder might think tht using severl loks per input split my signifintly impt filover. However, this is not true sine tsks performing n index sn re reltively short running. Therefore, the proility tht one node fils in this period of time is very low [4]. Still, in se node fils in this period of time, HAIL simply reshedules the filed mp tsks, whih results only in few seonds overhed nywys. Optionlly, HAIL ould pply the hekpointing tehniques proposed in [4] in order to improve filover. We will study these interesting spets in future work. The reder might lso think tht performne ould e negtively impted in se tht dt lolity is not hieved for severl mp tsks. However, fething smll prts of loks through the network (whih is the se when using index sn) is negligile [34]. Moreover, one n signifintly improve dt lolity y simply using n dequte sheduling poliy (e.g. the Dely Sheduler [46]). If no relevnt index exists, HAIL sheduling flls k to stndrd Hdoop sheduling y optimizing dt lolity only. For ll dt loks in B n, HAIL retes one mp tsk per unindexed dt lok just like stndrd Hdoop. Then, for eh mp tsk, HAIL onsiders r different omputing nodes s possile lotions to shedule mp tsk, where r is the replition ftor of the input dtset. However, in ontrst to originl Hdoop, HAIL prefers to ssign mp tsks to those nodes tht urrently store less indexes thn the verge. Sine HAIL stores pseudo dt lok replis lol to the mp tsks tht reted them, this sheduling strtegy results in lned index plement nd llows HAIL to etter prllelize index ess for future MpRedue jos. 8 Relted Work HAIL uses PAX [3] s dt lyout for HDFS lok, i.e., olumnr lyout inside the HDFS lok. PAX ws originlly invented for he-onsious proessing, ut it hs een dpted in the ontext of MpRedue [12]. In our previous work [34], we showed how to improve over PAX y omputing different lyouts on the different replis, ut we did not onsider indexing. This rtile fills this gp.
16 16 Stefn Rihter et l. Stti Indexing. Indexing is ruil step in ll mjor DBMSs [19,1,2,8,11]. The overll ide ehind ll these pprohes is to nlyze query worklod nd to sttilly deide whih ttriutes to index sed on these oservtions. Severl reserh works hve foused on supporting index ess in MpRedue workflows [45, 35, 15, 33]. However, ll these offline pprohes hve three ig disdvntges. First, they inur high upfront indexing ost tht severl pplitions nnot fford (suh s sientifi pplitions). Seond, they only rete single lustered index per dtset, whih is not suitle for query worklods hving seletion predites on different ttriutes. Third, they nnot dpt to hnges in worklods without the intervention of DBA. Online Indexing. Tuning dtse t uplod time hs eome hrder s query worklods eome more dynmi nd omplex. Thus, different DBMSs strted to use online tuning tools to ttk the prolem of dynmi worklods [42,6,7, 37]. The ide is to ontinuously monitor the performne of the system nd rete (or drop) indexes s soon s it is onsidered enefiil. Mniml [9,32] n e used s n online indexing pproh for utomtilly optimizing MpRedue jos. The ide of Mniml is to generte MpRedue jo for index retion s soon s n inoming MpRedue jo hs seletion predite on n unindexed ttriute. Online indexing n then dpt to query worklods. However, online indexing tehniques, require us to index dtset ompletely in one pss. Therefore, online indexing tehniques simply trnsfer the high ost of index retion from uplod time to query proessing time. Adptive Indexing. HAIL is inspired y dtse rking [28] whih ims t removing the high upfront ost rrier of index retion. The min ide of dtse rking is to strt orgnising given ttriute (i.e., to rete n dptive index on n ttriute) when it reeives for the first time query with seletion predite on tht ttriute. Thus, future inoming queries hving predites on the sme ttriute ontinue refining the dptive index s long s finer grnulrity of key rnges is dvntgeous. Key rnges in n dptive index re disjoint, where keys in eh key rnge re unsorted. Bsilly, dptive indexing performs for eh query one step of quiksort using the seletion predites s pivot for prtitioning ttriutes. HAIL differs from dptive indexing in four spets. First, HAIL retes lustered index for eh dt lok nd hene voids ny dt shuffling ross dt loks. This llows HAIL to preserve Hdoop fult-tolerne. Seond, HAIL onsiders disk-sed systems nd thus it ftors in the ost of reorgnising dt inside dt loks. Third, HAIL prllelises the indexing effort ross severl omputing nodes to minimise the indexing overhed. Fourth, HAIL fouses on reting lustered indexes insted of unlustered indexes. A follow-up work [29] fouses on lzily ligning ttriutes to onverge into lustered index fter ertin numer of queries. However, it onsiders min memory system nd hene does not ftor in the I/O-ost for moving dt mny times on disk. Other works on dptive indexing in min memory dtses hve foused on updtes [31], onurreny ontrol [2], nd roustness [25], ut these works re orthogonl to the prolem we ddress in this pper. Adptive Merging. Another relted work to HAIL is the dptive merging [21]. This pproh uses stndrd B- trees to persist intermedite results during n externl sort. Then, it only merges those key rnges tht re relevnt to queries. In other words, dptive merging inrementlly performs externl sort steps s side effet of query proessing. However, this pproh nnot e pplied diretly for MpRedue workflows for three resons. First, like dptive indexing, this pproh retes unlustered indexes. Seond, merging dt in MpRedue destroys Hdoop fulttolerne nd hurts the performne of MpRedue jos. This is euse dptive merging would require us to merge dt from severl dt loks into one. Notie tht, merging dt inside dt lok would not mke sense s dt lok is typilly loded entirely into min memory y mp tsks nywys. Third, it hs n expensive initil step to rete the first sorted runs. A follow-up work uses dptive indexing to redue the ost of the initil step of dptive merging in min memory [3]. However, it onsiders min memory systems nd hene it hs the first two prolems. Adptive Loding. Some other works fous on loding dt into dtse in n inrementl [1] or in lzy [27] mnner with the gol of reduing the upfront ost for prsing nd storing dt inside dtse. These pprohes llow for reduing the dely until users n exeute their first queries drmtilly. In the ontext of Hdoop, [1] proposes to lod those prts of dtset tht were prsed s input to MpRedue Jos into dtse t jo runtime. Hene, onseutive MpRedue Jos tht require the sme dt n enefit, e.g. from the inry representtion or indexes inside the dtse store. However, this senrio lredy involves n dditionl roundtrip of first writing the dt to HDFS, reding it from HDFS to then gin store the dt inside dtse plus some overhed for index retion. In ontrst to these works, HAIL ims t reduing the upfront ost of dt prsing nd index retion lredy when loding dt into HDFS. In other words, while these pprohes im t dptively uploding rw dtsets from HDFS into dtse to improve performne, HAIL ims t indexing rw dtsets diretly in HDFS to improve performne, without dditionl red/write yles. NoDB, nother reent work, proposes to run queries diretly on rw dtsets [4]. Additionlly, this pproh (i) rememers the offsets of individul ttriute vlues, nd (ii) hes inry vlues from the dtset whih re oth extrted s yproduts of query exeution. Those optimiztions llow for reduing the tokenizing nd prsing osts for onseutive queries tht touh
17 Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop 17 previously proessed prts of the dtset. However, NoDB onsiders single node senrio using lol file system, while HAIL onsiders distriuted environment nd distriuted file system. As shown in our experiments, writing to HDFS is I/O ound nd prsing the ttriutes of dtset entirely n e performed in prllel to storing the dt in HDFS. Sine dt prsing does not use notiele runtime overhed in our senrio, inrementl loding tehniques s presented in [4] re not required for HAIL. Furthermore, NoDB does not onsider different sort orders or indexes to improve dt ess. To the est of our knowledge, this work is the first work tht ims t pushing indexing to the extreme t low index retion ost nd to propose n dptive indexing solution suitle for MpRedue systems. 9 Experiments Let s get k to Bo gin nd his initil question: will HAIL solve his indexing prolem effiiently? To nswer this question, we need to run first wve of experiments in order to nswer the following questions s well: (1.) Wht is the performne of HAIL t uplod time? Wht is the impt of stti indexing in the uplod pipeline? How mny indexes n we rete in the time the stndrd HDFS uplods the dt? How does hrdwre performne ffet HAIL uplod? How well does HAIL sle-out on lrge lusters? (We nswer these questions in Setion 9.3). (2.) Wht is the performne of HAIL t query time? How muh does HAIL enefit from sttilly reted indexes? How does query seletivity ffet HAIL? How do filing nodes ffet performne? (We nswer these questions in Setion 9.4). How does HilSplitting improve end-to-end jo runtimes? (We nswer this question in Setion 9.5). But, wht hppens if Bo did not rete the right indexes upfront? How n Bo dpt his indexes to new worklod tht he did not predit t uplod time? For this, we need to evlute the effiieny of HAIL to dpt to query worklods nd ompre it with Hdoop nd version of HAIL, tht only uses stti indexing. We present seond wve of experiments to nswer the following min questions: (3.) Wht is the overhed of running the dptive indexing tehniques in HAIL? How fst n HAIL dpt to hnges in the query worklod? How muh n MpRedue jos enefit from the dptivity of HAIL? How well does eh of the dptive indexing tehnique of HAIL llow MpRedue jos to improve their runtime? (We nswer these questions in Setion 9.6) 9.1 Hrdwre nd Systems Hrdwre. We use six different lusters. One is physil 1-node luster. Eh node hs one 2.66GHz Qud Core Xeon proessor running 64-it pltform Linux opensuse 11.1 OS, 4x4GB of min memory, 6x75GB SATA HD, nd three Gigit network rds. Our physil luster hs the dvntge tht the mount of runtime vrine is limited [41]. Yet, to fully understnd the sle-up properties of HAIL, we use three different EC2 lusters, eh hving 1 nodes. For eh of these three lusters, we use different node types (see Setion 9.3.3). Finlly, to understnd how well HAIL slesout, we onsider two more EC2 lusters: one with 5 nodes nd one with 1 nodes (see Setion 9.3.4). Systems. We ompred the following systems: (1) Hdoop, (2) Hdoop++ s desried in [15], nd (3) HAIL s desried in this rtile. For HAIL, we disle the HAIL splitting in Setion 9.4 in order to mesure the enefits of using this poliy in Setion 9.5. All three systems re sed on Hdoop.2.23 nd re ompiled nd run using Jv 7. All systems were onfigured to use the defult HDFS lok size of 64MB if not mentioned otherwise. 9.2 Dtsets nd Queries Dtsets. For our enhmrks we use two different dtsets. First, we use the UserVisits tle s desried in [39]. This dtset niely mthes Bo s Use Cse. We generted 2GB of UserVisits dt per node using the dt genertor proposed y [39]. Seond, we dditionlly use Syntheti dtset onsisting of 19 integer ttriutes in order to understnd the effets of seletivity. Notie tht, this Syntheti dtset is similr to sientifi dtsets, where ll or most of the ttriutes re integer/flot ttriutes (e.g., the SDSS dtset). For this dtset, we generted 13GB per node. Queries. For the UserVisits dtset, we onsider the following queries s Bo s worklod: Bo-Q1 (seletivity: 3.1 x 1 2 ) SELECT soureip FROM UserVisits WHERE visitdte BETWEEN AND Bo-Q2 (seletivity: 3.2 x 1 8 ) SELECT serhword, durtion, drevenue FROM UserVisits WHERE soureip= Bo-Q3 (seletivity: 6 x 1 9 ) SELECT serhword, durtion, drevenue FROM UserVisits WHERE soureip= AND visitdte= Bo-Q4 (seletivity: 1.7 x 1 2 ) SELECT serhword, durtion, drevenue FROM UserVisits WHERE drevenue>=1 AND drevenue<=1 Additionlly, we use vrition of query Bo-Q4 to see how well HAIL performs on queries with low seletivities: Bo-Q5 (seletivity: 2.4 x 1 1 ) SELECT serhword, durtion, drevenue FROM UserVisits WHERE drevenue>=1 AND drevenue<=1
18 18 Stefn Rihter et l. Uplod time [se] Hdoop Hdoop++ HAIL Numer of reted indexes Uplod time [se] Hdoop Hdoop++ HAIL Numer of reted indexes Numer of reted replis () Uplod time for UserVisits () Uplod time for Syntheti () Vrying replition for Syntheti Fig. 6 Uplod times when vrying the numer of reted indexes ()&() nd the numer of dt lok replis () Uplod time [se] Hdoop HAIL Hdoop uplod time with 3 replis (defult) For the Syntheti dtset, we use the queries in Tle 2. Notie tht, for Syntheti ll queries use the sme ttriute for filtering. Hene, for this dtset HAIL nnot enefit from its different indexes: it retes three different indexes, yet only one of them will e used y these queries. Tle 2 Syntheti queries. Query #Projeted Attriutes Seletivity Syn-Q Syn-Q1 9.1 Syn-Q1 1.1 Syn-Q Syn-Q2 9.1 Syn-Q2 1.1 For ll queries nd experiments, we report the verge runtime of three trils. 9.3 Dt Loding We strongly elieve tht uplod time is ruil spet for to dopt prllel dt-intensive system. This is euse most users (suh s Bo or sientists) wnt to strt nlyzing their dt erly. In ft, low strtup osts re one of the ig dvntges of stndrd Hdoop over RDBMSs. Thus, we exhustively study the uplod performne of HAIL Vrying the Numer of Indexes We first mesure the impt in performne when reting indexes sttilly. For this, we sle the numer of indexes to rete when uploding the UserVisits nd the Syntheti dtsets. For HAIL, we vry the numer of indexes from to 3 nd for Hdoop++ from to 1 (this is euse Hdoop++ nnot rete more thn one index). For Hdoop, we only report numers with indexes s it nnot rete ny index. Figure 6() shows the results for the UserVisits dtset. We oserve tht HAIL hs negligile uplod overhed of 2% over stndrd Hdoop. Then, when HAIL retes one index per repli the overhed still remins very low (t most 14%). On the other hnd, we oserve tht HAIL improves over Hdoop++ y ftor of 5.1 when reting no index nd y ftor of 7.3 when reting one index. This is euse Hdoop++ hs to run two expensive MpRedue jos for reting one index. For HAIL, we oserve tht for two nd three indexes the uplod osts inrese only slightly. Figure 6() illustrtes the results for the Syntheti dtset. We oserve tht HAIL signifintly outperforms Hdoop++ gin y ftor of 5.2 when reting no index nd y ftor of 8.2 when reting one index. On the other hnd, we now oserve tht HAIL outperforms Hdoop y ftor of 1.6 even when reting three indexes. This is euse the Syntheti dtset is well suited for inry representtion, i.e., in ontrst to the UserVisits dtset, HAIL n signifintly redue the initil dtset size. This llows HAIL to outperform Hdoop even when reting one, two, or three indexes. For the remining uplod experiments, we disrd Hdoop++ s we lerly sw in this setion tht it does not uplod dtsets effiiently. Therefore, we fous on HAIL using Hdoop s seline Vrying the Replition Ftor We now nlyze how well HAIL performs when inresing the numer of replis. In prtiulr, we im t finding out how mny indexes HAIL n rete for given dtset in the sme time stndrd Hdoop needs to uplod the sme dtset with the defult replition ftor of three nd reting no indexes. To do this, we uplod the Syntheti dtset with different replition ftors. In this experiment, HAIL retes s mny lustered indexes s lok replis. In other words, when HAIL uplods the Syntheti dtset with replition ftor of five, it retes five different lustered index for eh lok. Figure 6() shows the results for this experiment. The dotted line mrks the time Hdoop tkes to uplod with the defult replition ftor of three. We see tht HAIL signifintly outperforms Hdoop for ny replition ftor nd up to ftor of 2.5. More interestingly, we oserve tht HAIL stores six replis (nd hene it retes six different lustered indexes) in little less thn the sme time Hdoop uplods the sme dtset with only three replis without reting ny index. Still, when inresing the replition ftor even further for HAIL, we see tht HAIL hs only minor overhed over Hdoop with three replis only. These results lso show tht hoosing the replition f-
19 Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop 19 Tle 3 Sle-up results () Uplod times for UserVisits when sling-up [se] Cluster Node Type Hdoop HAIL System Speedup Lrge Extr Lrge Cluster Qudruple Sle-Up Speedup Physil () Uplod times for Syntheti when sling-up [se] Cluster Node Type Hdoop HAIL System Speedup Lrge Extr Lrge Cluster Qudruple Sle-Up Speedup Physil tor minly depends on the ville disk spe. Even in this respet, HAIL improves over Hdoop. For exmple, while Hdoop needs 39GB to uplod the Syntheti dtset with 3 lok replis, HAIL needs only 42GB to uplod the sme dtset with 6 lok replis! HAIL enles users to stress indexing to the extreme to speed up their query worklods Cluster Sle-Up In this setion, we study how different hrdwre ffets HAIL uplod times. For this, we rete three 1-nodes EC2 lusters: the first uses lrge (m1.lrge) nodes 1, the seond extr lrge (m1.xlrge) nodes, nd the third luster qudruple (1.4xlrge) nodes. We uplod the UserVisits nd the Syntheti dtsets on eh of these lusters. We report the results of these experiments in Tle 3() (for UserVisits) nd in Tle 3() (for Syntheti), where we disply the System Speedup of HAIL over Hdoop s well s the Sle-Up Speedup for Hdoop nd HAIL. Additionlly, we show gin the results for our lol luster s seline. As expeted, we oserve tht oth Hdoop nd HAIL enefit from using etter hrdwre. In ddition, we lso oserve tht HAIL lwys enefits from sling-up omputing nodes. Espeilly, using etter CPU mkes prsing to inry fster. As result, HAIL dereses (in Tle 3()) or inreses (Tle 3()) the performne gp with respet to Hdoop when sling-up (System Speedup). We see tht Hdoop signifintly improves its performne when sling from Lrge (1844 s) to Extr Lrge (1296 s) instnes. This is thnks to the etter I/O susystem of the Extr Lrge instne types. When sling from Extr Lrge to Cluster Qudruple instnes we see no rel improvement, sine the I/O susystem stys the sme nd only the CPU power inreses. In ontrst, HAIL enefits from dditionl nd/or etter CPU ores when sling up. 1 For this luster type, we llote n dditionl lrge node to run the nmenode nd jotrker. Finlly, we oserve tht the system speedup of HAIL over Hdoop is even etter when using physil nodes Cluster Sle-Out At this point, the reder might hve lredy strted wondering how well HAIL performs for lrger lusters. To nswer this question, we llote one 5-nodes EC2 luster nd one 1-nodes EC2 luster. We use luster qudruple (1.4xlrge) nodes for oth lusters, euse with this node type we experiened the lowest performne vriility. In oth lusters, we lloted two dditionl nodes: one to serve s Nmenode nd the other to serve s JoTrker. While vrying the numer of nodes per luster we keep the mount of dt per node onstnt. Uplod Time [se] Fig. 7 Sle-out results 1742 Hdoop HAIL Syn UV Syn UV Syn UV 1 nodes 5 nodes 1 nodes Numer of Nodes Figure 7 shows these results. We oserve tht HAIL hieves roughly the sme uplod times for the Syntheti dtset. For the UserVisits dtset, we see tht HAIL improves its uplod times for lrger lusters. In prtiulr, for 1 nodes, we see tht HAIL mthes the Hdoop uplod times for the UserVisits dtset nd outperforms Hdoop y ftor up to 1.4 for the Syntheti dtset. More interesting, we oserve tht, in ontrst to Hdoop, HAIL does not suffer from high performne vriility [41]. Overll, these results show the effiieny of HAIL when sling-out. 9.4 MpRedue Jo Exeution We now nlyze the performne of HAIL when running MpRedue jos. Our min gol for ll these experiments is to understnd how well HAIL n perform ompred to the stndrd Hdoop MpRedue nd Hdoop++ systems. With this in mind, we mesure two different exeution times. First, we mesure the end-to-end jo runtimes, whih is the time given jo tkes to run ompletely. Seond, we mesure the reord reder runtimes, whih is dominted y the time given mp tsk spends reding its input dt. Rell tht for these experiments, we disle the HilSplitting poliy (presented in Setion 7) in order to etter evlute the enefits of hving severl lustered indexes per dtset. We study the enefits of HilSplitting in Setion
20 2 Stefn Rihter et l. Jo Runtime [se] Hdoop Hdoop++ HAIL Bo-Q1 Bo-Q2 Bo-Q3 Bo-Q4 Bo-Q5 MpRedue Jos RR Runtime [ms] 4 Hdoop Hdoop ++ HAIL Bo-Q1 Bo-Q2 Bo-Q3 Bo-Q4 Bo-Q5 MpRedue Jos Hdoop Hdoop++ HAIL Overhed Bo-Q1 Bo-Q2 Bo-Q3 Bo-Q4 Bo-Q5 MpRedue Jos () End-to-end jo runtimes () Averge reord reder runtimes () Hdoop sheduling overhed Fig. 8 Jo runtimes, reord reder times, nd Hdoop MpRedue frmework overhed for Bo s query worklod filtering on multiple ttriutes Jo Runtime [se] Jo Runtime [se] Hdoop Hdoop++ HAIL Syn-Q1 Syn-Q2 MpRedue Jos RR Runtime [ms] Hdoop Hdoop++ HAIL Syn-Q1 Syn-Q2 MpRedue Jos Hdoop Hdoop++ HAIL Overhed Syn-Q1 Syn-Q2 MpRedue Jos () End-to-end jo runtimes () Averge reord reder runtimes () Hdoop sheduling overhed Fig. 9 Jo runtimes, reord reder times, nd Hdoop sheduling overhed overhed for Syntheti query worklod filtering on single ttriute Jo Runtime [se] Bo s Query Worklod For these experiments: Hdoop does not rete ny index; sine Hdoop++ n only rete single lustered index, it retes one lustered index on soureip for ll three replis, s two very seletive queries will enefit from this; HAIL retes one lustered index for eh repli: one on visit- Dte, one on soureip, nd one on drevenue. Figure 8() shows the verge end-to-end runtimes for Bo s queries. We oserve tht HAIL outperforms oth Hdoop nd Hdoop++ in ll queries. For Bo-Q2 nd Bo- Q3, Hdoop++ hs similr results s HAIL sine oth systems hve n index on soureip. However, HAIL still outperforms Hdoop++. This is euse HAIL does not hve to red ny lok heder to ompute input splits while Hdoop++ does. Consequently, HAIL strts proessing the input dtset erlier nd hene it finishes efore. Figure 8() shows the ReordReder times 11. One more gin, we oserve tht HAIL outperforms oth Hdoop nd Hdoop++. HAIL is up to ftor 46 fster thn Hdoop nd up to ftor 38 fster thn Hdoop++. This is euse Hdoop++ is only ompetitive if it hppens to hit the right index. As HAIL hs dditionl lustered indexes (one for eh repli), the likelihood to hit n index inreses. Then, query runtimes for Bo-Q1, Bo-Q4, nd Bo-Q5 re shrply improved over Hdoop nd Hdoop++. Yet, if HAIL llows mp tsks to red their input dt y more thn one order of mgnitude fster thn Hdoop nd Hdoop++, why do MpRedue jos not enefit from this? To understnd this we estimte the overhed of the 11 This is the time mp tsk tkes to red nd proess its input. Hdoop MpRedue frmework. We do this y onsidering n idel exeution time, i.e., the time needed to red ll the required input dt nd exeute the mp funtions over suh dt. We estimte the idel exeution time T idel = #MpTsks/#PrllelMpTsks Avg(T ReordReder ). Here #PrllelMpT sks is the mximum numer of mp tsks tht n e performed t the sme time y ll omputing nodes. We define the overhed s T overhed = T end-to-end T idel. We show the results in Figure 8(). We see tht the Hdoop frmework overhed is in ft dominting the totl jo runtime. This hs mny resons. A mjor reson is tht Hdoop ws not uilt to exeute very short tsks. To shedule single tsk, Hdoop spends severl seonds even though the tul tsk just runs in few ms (s it is the se for HAIL). Therefore, reduing the numer of mp tsks of jo ould gretly derese the end-to-end jo runtime. We tkle this prolem in Setion Syntheti Query Worklod Our gol in this setion is to study how query seletivities ffet the performne of HAIL. Rell tht for this experiment HAIL nnot enefit from its different indexes: ll queries filter on the sme ttriute. We use this setup to isolte the effets of seletivity. We present the end-to-end jo runtimes in Figure 9() nd the reord reder times in Figure 9(). We oserve in Figure 9() tht HAIL outperforms oth Hdoop nd Hdoop++. We see gin tht even if Hdoop++ hs n index on the seleted ttriute, Hdoop++ runs slower thn HAIL. This is euse HAIL hs slightly different splitting phse thn Hdoop++. Looking t the results in Figure 9(),
21 Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop 21 Jo Runtime [se] % slowdown 199 Fig. 1 Fult-tolerne results Hdoop HAIL Slowdown 1.5 % slowdown Hdoop HAIL HAIL-1Idx Systems 5.5 % slowdown the reder might think tht HAIL is etter thn Hdoop++ euse of the PAX lyout used y HAIL. However, we lerly see in the results for query Syn-Q1 tht this is not true 12. We oserve tht even in this se HAIL is etter thn Hdoop++. The reson is tht the index size in HAIL (2KB) is muh smller thn the index size in Hdoop++ (34KB), whih llows HAIL to red the index slightly fster. On the other hnd, we see tht Hdoop++ slightly outperforms HAIL for ll three Syn-Q2 queries. This is euse these queries re more seletive nd then the rndom I/O ost due to tuple reonstrution strts to dominte the reord reder times. Surprisingly, we oserve tht query seletivity does not ffet end-to-end jo runtimes (see Figure 9()) even if query seletivity hs ler impt on the ReordReder times (see Figure 9()). As explined in Setion 9.4.1, this is due to the overhed of the Hdoop MpRedue frmework. We lerly see this overhed in Figure 9(). In Setion 9.5, we will investigte this in more detil Fult-Tolerne In very lrge-sle lusters (espeilly on the Cloud), node filures re no more n exeption ut rther the rule. A ig dvntge of Hdoop MpRedue is tht it n grefully reover from these filures. Therefore, it is ruil to preserve this key property to relily run MpRedue jos with miniml performne impt under filures. In this setion we study the effets of node filures in HAIL nd ompre it with stndrd Hdoop MpRedue. We perform these experiments s follows: (i) we set the expiry intervl to detet tht TskTrker or dtnode filed to 3 seonds, (ii) we hose node rndomly nd kill ll Jv proesses on tht node fter 5% of work progress, nd (iii) we mesure the slowdown s in [15], slowdown = (T f T ) T 1, where T is the jo runtime without node filures nd T f is the jo runtime with node filure. We use two onfigurtions for HAIL. First, we onfigure HAIL to rete indexes on three different ttriutes, one for eh repli. Seond, we use vrint of HAIL, oined HAIL-1Idx, where we rete n index on the sme ttriute for ll three replis. We do so to mesure the performne 12 Rell tht this query projets ll ttriutes, whih is indeed more enefiil for Hdoop++ s it uses row lyout. impt of HAIL flling k to full sn for some loks fter the node filure. This hppens for ny mp tsk reding its input from the killed node. Notie tht, in the se of HAIL-1Idx, ll mp tsks will still perform n index sn s ll loks hve the sme index. Figure 1 shows the fult-tolerne results for Hdoop nd HAIL. Overll, we oserve tht HAIL preserves the filover property of Hdoop y hving lmost the sme slowdown. However, it is worth noting tht HAIL n even improve over Hdoop. This is euse HAIL n still perform n index sn when hving the sme index on ll replis (HAIL-1Idx). We lerly see this when HAIL retes the sme index on ll replis (HAIL-1Idx). In this se, HAIL hs lower slowdown sine filed mp tsks n still perform n index sn even fter filure. As result, HAIL runs lmost s fst s when no filure ours. 9.5 Impt of the HAIL Splitting Poliy We oserved in Figures 8() nd 9() tht the Hdoop MpRedue frmework inurs high overhed in the endto-end jo runtimes. To evlute the effiieny of HAIL to del with this prolem, we now enle the HilSplitting poliy (desried in Setion 7) nd run gin the Bo nd Syntheti queries on HAIL. Figure 11 illustrtes these results. We lerly oserve tht HAIL signifintly outperforms oth Hdoop nd Hdoop++. We see in Figure 11() tht HAIL outperforms Hdoop up to ftor of 68 nd Hdoop++ up to ftor of 73 for Bo s worklod. This is minly euse the HilSplitting poliy signifintly redues the numer of mp tsks from 3, 2 (whih is the numer of mp tsks for Hdoop nd Hdoop++) to only 2. As result of HAIL Splitting poliy, the sheduling overhed does not impt the end-toend worklod runtimes in HAIL (see Setion 9.4.1). For the Syntheti worklod (Figure 11()), we oserve tht HAIL outperforms Hdoop up to ftor of 26 nd Hdoop++ up to ftor of 25. Overll, we oserve in Figure 11() tht using HAIL Bo n run ll his five queries 39x fster thn Hdoop nd 36x fster thn Hdoop++. We lso oserve tht HAIL runs ll six Syntheti queries 9x fster thn Hdoop nd 8x fster thn Hdoop HAIL Adptive Indexing In the previous experiments we foused on the performne of HAIL with stti indexing only, i.e., we detivted HAIL dptive indexing. For the following experiments we now fous on the evlution of the HAIL dptive indexing pipeline. In ddition to the 1-node luster (Cluster-A) we used in previous experiments, we use n dditionl 4-node luster (Cluster-B) in order to mesure the influene of more effiient proessors. In Cluster-B, eh node hs: one 3.46 GHz
22 22 Stefn Rihter et l. Jo Runtime [se] Hdoop Hdoop++ HAIL Bo-Q1 Bo-Q2 Bo-Q3 Bo-Q4 Bo-Q5 MpRedue Jos 7 Hdoop Hdoop++ HAIL Syn-Q1 Syn-Q2 MpRedue Jos () Bo queries () Syntheti queries Fig. 11 End-to-end jo runtimes for Bo nd Syntheti queries using the HilSplitting poliy Jo Runtime [se] Totl Runtime [se] Hdoop Hdoop++ HAIL Bo Syntheti Worklod () Totl Worklod Hex Core Xeon X569 proessors; 2GB of min memory; one 278GB SATA hrd disk (for the OS) nd one 837GB SATA hrd disk (for HDFS); two one Gigit network rds. Sine the results from previous experiments lerly showed the high superiority of HAIL over Hdoop++, we deide to disrd Hdoop++ nd keep only Hdoop nd HAIL with no dptive indexing tivted s selines. For HAIL using the dptive indexing tehniques, we onsider four different vrints ording to the offer rte ρ: HAIL (ρ =.1), HAIL (ρ =.25), HAIL (ρ =.5), nd HAIL (ρ = 1). Notie tht HAIL with no dptive indexing is the sme s HAIL (ρ = ). Still, s in previous setions, we ssume tht HAIL retes one index on soureip, one on visit- Dte, nd one on drevenue, for the UserVisits dtset. For the Syntheti dtset, we ssume tht HAIL does not rete ny index t uplod time. Notie tht, given the high Hdoop sheduling overhed we oserved in previous experiments, we inrese the dt lok size to 256MB to derese suh overhed for Hdoop. Moreover, mking use of the lessons lerned from the first wve of experiments, we slightly hnge our dtsets nd queries in order to stress nd etter evlute HAIL under igger dtsets nd different query seletivities. We desrie these hnges in the following. Dtsets. We gin use the we log dtset (UserVisits) ut sled it to 4GB per node, i.e., 4GB for Cluster-A nd 16GB for Cluster-B. Additionlly, the Syntheti dtset hs now six ttriutes nd totl size of 5GB per node, i.e., 5GB for Cluster-A nd 2GB for Cluster-B. We generte the vlues for the first ttriute in the rnge [1..1] nd with n exponentil repetition for eh vlue, i.e., 1 i 1 where i [1..1]. We generte the other five ttriutes t rndom. Then, we shuffle ll tuples ross the entire dtset to hve the sme distriution ross dt loks. MpRedue Jos. For the UserVisits dtset, we onsider eleven jos (JoUV1 JoUV11) with seletion predite on ttriute serhword nd with full projetion (i.e., projeting ll 9 ttriutes). The first four jos JoUV1 JoUV4 hve seletivity of.4% (1.24 million output reords) nd the remining seven jos (JoUV5 JoUV11) hve seletivity of.2% (.62 million output reords). For the Syntheti dtset, we onsider other eleven jos (Jo- Syn1 JoSyn11) with full projetion, ut with seletion predite on the first ttriute. These jos hve seletively of.2% (2.2 million output reords). All jos for oth dtsets selet disjoint rnges to void hing effets Performne for the First Jo Sine HAIL piggyks dptive indexing on MpRedue jos, the very first question tht the reder might sk is: wht is the dditionl runtime inurred y HAIL on MpRedue jos? We nswer this question in this setion. For this, we run jo JoUV1 for UserVisits nd jo JoSyn1 for Syntheti. For these experiments, we ssume tht there is no lok with relevnt index for jos JoUV1 nd JoSyn1. Figure 12 shows the jo runtime for five vrints of HAIL for the UserVisits dtset. In Cluster-A, we oserve tht HAIL hs lmost no overhed (only 1%) over HAIL (ρ = ) when using n offer rte of 1% (i.e., ρ =.1). Notie tht HAIL (ρ = ) hs no mthing index ville nd hene ehves like norml Hdoop with just the inry PAX lyout to speed up the jo exeution. We n lso see tht the new lyout gives us n improvement of t most ftor of two in our experiments. Interestingly, we oserve tht HAIL is still fster thn Hdoop with ρ =.1 nd ρ =.25. Indeed, the overhed inurred y HAIL inreses long with the offer rte used y HAIL. However, we oserve tht HAIL inreses the exeution time of JoUV1 y less thn ftor of two w.r.t. oth Hdoop nd HAIL without ny indexing, even though ll dt loks re indexed in single MpRedue jo. We espeilly oserve tht the overhed inurred y HAIL sles linerly with the rtio of indexed dt loks (i.e., with ρ), exept when sling from ρ =.1 to ρ =.25. This is euse HAIL strts to e CPU ound only when offering more thn 2% of the dt loks (i.e., from ρ =.25). This hnges when running JoUV1 in Cluster-B. In these results, we lerly oserve tht the overhed inurred y HAIL sles linerly with ρ. We espeilly oserve tht HAIL enefits from using newer CPUs nd hve etter performne thn Hdoop for most offer rtes. HAIL hs only 4% overhed over Hdoop when hving ρ = 1. Additionlly, we n see tht the dptive indexing in HAIL inurs low overhed: from 1% (with ρ =.1) to 43% (with ρ = 1).
23 Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop 23 Jo runtime [s] Hdoop HAIL (ρ=) HAIL (ρ=.1) HAIL (ρ=.25) HAIL (ρ=.5) HAIL (ρ=1) Cluster A Cluster B Jo runtime [s] Hdoop HAIL (ρ=) HAIL (ρ=1) HAIL (ρ=.5) HAIL (ρ=.25) HAIL (ρ=.1) Cluster A Cluster B JoUV1 JoUV1 Fig. 12 HAIL Performne when running the first MpRedue jo over UserVisits. Jo runtime [s] Hdoop HAIL (ρ=) HAIL (ρ=.1) HAIL (ρ=.25) HAIL (ρ=.5) HAIL (ρ=1) Cluster A JoSyn1 JoSyn1 Fig. 13 HAIL Performne when running the first MpRedue jo over Syntheti. 864 Cluster B Figure 13 shows the jo runtimes for Syntheti. Overll, we oserve tht the overhed inurred y HAIL ontinues to sle linerly with the offer rte. In prtiulr, we oserve tht HAIL hs no overhed over Hdoop in oth lusters, exept for HAIL (ρ = 1) in Cluster-A (where HAIL inurs negligile overhed of 3%). It is worth noting tht when using newer CPUs (Cluster-B) dptive indexing in HAIL hs very low overhed s well: from 9% to only 23%. From these results, we n onlude tht HAIL n effiiently rete indexes t jo runtime while limiting the overhed of writing pseudo dt loks. We oserve the effiieny of the lzy dptive indexing mehnism of HAIL to dpt to users requirements vi different offer rtes Performne for Sequene of Jos We sw in the previous setion tht HAIL dptive indexing tehniques n sle linerly with the help of the offer rte. But, whih re the implitions for sequene of MpRedue jos? To nswer this question, we run the sequene of eleven MpRedue jos for eh dtset. Figures 14 nd 15 show the jo runtimes for the UserVisit nd Syntheti dtsets, respetively. Overll, we lerly see in oth omputing lusters tht HAIL improves the performne of MpRedue jos linerly with the numer of indexed dt loks. In prtiulr, we oserve tht the higher the offer rte, the fster HAIL onverges to omplete index. However, the higher the offer rte, the higher the dptive indexing overhed for the initil jo (JoUV1 nd JoSyn1). Thus, users re fed with nturl trdeoff etween indexing overhed nd the required numer of jos JosUV JosUV Fig. 14 HAIL performne when running sequene of MpRedue jos over UserVisits. Jo runtime [s] Hdoop HAIL (ρ=) HAIL (ρ=1) HAIL (ρ=.5) HAIL (ρ=.25) HAIL (ρ=.1) Cluster A Cluster B JosSyn JosSyn Fig. 15 HAIL performne when running sequene of MpRedue jos over Syntheti. to index ll loks. But, it is worth noting tht users n use low offer rtes (e.g. ρ =.1) nd still quikly onverge to omplete index (e.g. fter 1 jo exeutions for ρ =.1). In prtiulr, we oserve tht fter exeuting only few jos HAIL lredy outperforms Hdoop signifintly. For exmple, let us onsider the sequene of jos on Syntheti using ρ =.25 on Cluster-B. Rememer tht for this offer rte the overhed for the first jo ompred to HAIL without ny indexing is reltively smll (11%) while HAIL is still le to outperform Hdoop. With the seond jo HAIL is slightly fster thn the full sn nd the fourth jo improves over full sn in HAIL y more thn ftor of two nd over Hdoop y more thn ftor of five 13. As soon s HAIL onverges to omplete index, HAIL signifintly outperforms full sn jo exeution in HAIL y up to ftor of 23 nd Hdoop y up to ftor of 52. For the UserVisits dtset, HAIL outperforms unindexed HAIL y up to ftor of 24 nd Hdoop y up to ftor of 32. Notie tht, performing full sn over Syntheti in HAIL is fster thn in Hdoop, euse HAIL redues the size of this dtset when onverting it to inry representtion. In summry, the results show tht HAIL n effiiently dpt to query worklods with very low overhed only for the very first jo: the following jos lwys enefit from the indexes reted in previous jos. Interestingly, n importnt result is tht HAIL n onverge to omplete index fter running only few jos. 13 Although HAIL is still indexing further loks.
24 24 Stefn Rihter et l. Jo runtime [s] HAIL (eger) HAIL (ρ=.1) HAIL (ρ=1) HAIL (ρ=) JosUV Fig. 16 Eger dptive indexing vs. ρ =.1 nd ρ = Eger Adptive Indexing for Sequene of Jos We sw in the previous setion tht HAIL improves the performne of MpRedue jos linerly with the numer of indexed dt loks. Now, the question tht might rise in the reder s mind is: n HAIL effiiently exploit the sved runtimes for further dptive indexing? To nswer this question, we enle the eger dptive indexing strtegy in HAIL nd run gin ll UserVisits jos using n initil offer rte of 1%. In these experiments, we use Cluster-A nd onsider HAIL (without eger dptive indexing enled) with offer rtes of 1% nd 1% s selines. Figure 16 show the result of this experiment. As expeted, we oserve tht HAIL (eger) hs the sme performne s HAIL (ρ =.1) for JoUV1. However, in ontrst to HAIL (ρ =.1), HAIL (eger) keeps its performne onstnt for JoUV2. This is euse HAIL (eger) utomtilly inreses ρ from.1 to.17 in order to exploit sved runtimes. For JoUV3, HAIL (eger) still keeps its performne onstnt y inresing ρ from.17 to.33. Now, even though HAIL (eger) inreses ρ from.33 to 1 for JoUV4, HAIL (eger) now improves the jo runtime s only 4% of the dt loks remin unindexed. As result of dpting its offer rte, HAIL (eger) onverges to omplete index only fter 4 jos while inurring lmost no overhed over HAIL. From JoUV5, HAIL (eger) ensures the sme performne s HAIL (ρ = 1) sine ll dt loks re lredy indexed, while HAIL (ρ =.1) tkes 6 more jos to onverge to omplete index, i.e., to index ll dt loks. These results show tht HAIL n onverge even fster to omplete index, while still keeping negligile indexing overhed for MpRedue jos. Overll, these results demonstrte the high effiieny of HAIL (eger) to dpt its offer rte ording to the numer of lredy indexed dt loks. 1 Conlusion We presented HAIL (Hdoop Aggressive Indexing Lirry), twofold pproh towrds zero-overhed indexing in Hdoop MpRedue. HAIL introdued two indexing pipelines tht ddress two mjor prolems of trditionl indexing tehniques. First, HAIL stti indexing solves the prolem of long indexing times whih hd to e invested on previous indexing pprohes in Hdoop. This ws severe drwk of Hdoop++ [15], whih required expensive MpRedue jos in the first ple to rete indexes. Seond, HAIL dptive indexing llows us to utomtilly dpt the set of ville indexes to previously unknown or hnging worklods t runtime with only miniml osts. In more detil, HAIL stti indexing llows users to effiiently uild lustered indexes while uploding dt to HDFS. Therey, our novel onept of logil replition enles the system to rete different sort orders (nd hene lustered indexes) for eh physil repli of dt set without dditionl storge overhed. This mens tht in stndrd system setup, HAIL n rete three different indexes (lmost) for free s yprodut of uploding the dt to HDFS. We hve shown tht HAIL stti indexing lso works well for lrger numer of replis. E.g. in our experiments HAIL reted six different lustered indexes in the sme time HDFS took to just uplod three yte-identil opies without ny index. With HAIL stti indexing, we n lredy provide severl mthing indexes for vriety of queries. Still, our stti indexing pproh hs similr limittions s other trditionl tehniques when it omes to unknown or hnging worklods. The prolem is, tht users hve to deide upfront on whih ttriutes to index nd it is usully ostly to revisit this hoie in se of missing indexes. We solve this prolem with HAIL dptive indexing. Using this pproh, our system n rete missing ut vlule indexes utomtilly nd inrementlly t jo exeution time. In ontrst to previous work, our dptive indexing tehnique gin fouses on indexing t miniml expense. We hve experimentlly ompred HAIL with Hdoop s well s Hdoop++ using different dtsets nd numer of different lusters. The results demonstrted the high superiority of HAIL. For HAIL stti indexing, our experiments showed tht we typilly rete win-win sitution: e.g. users n uplod their dtsets up to 1.6x fster thn Hdoop (despite the dditionl indexing effort!) nd run jos up to 68x fster thn Hdoop. Our seond set of experiments demonstrted the high effiieny of HAIL dptive indexing to rete lustered indexes t jo runtime nd dpt to users worklods. In terms of indexing effort, HAIL dptive indexing hs very low overhed ompred to HAIL full sn (whih is lredy 2x fster thn Hdoop full sn). For exmple, we oserved 1% runtime overhed for the UserVisits dtset when using n offer rte of 1% nd only for the very first jo. The following jos lredy run fster thn the full sn in HAIL, e.g. 2 times fster from the fourth jo, with n offer rte of 25%. The results lso show tht, even for low offer rtes, our pproh quikly onverges to omplete index fter running only few numer of MpRedue jos (e.g. fter 1 jos with n offer rte of 1%). In terms of jo runtimes, HAIL
25 Towrds Zero-Overhed Stti nd Adptive Indexing in Hdoop 25 dptive indexing improves performne drmtilly. For sequene of previously unseen jos on unindexed ttriutes, runtime improved y up to ftor of 24 over HAIL without dptive indexing nd ftor of 52 over Hdoop. Aknowledgments. Reserh supported y the Cluster of Exellene on Multimodl Computing nd Intertion nd the Bundesministerium für Bildung und Forshung. Referenes 1. A. Aouzied, D. J. Adi, nd A. Silershtz. Invisile Loding: Aess-Driven Dt Trnsfer from Rw Files into Dtse Systems. In EDBT, pges 1 1, S. Agrwl et l. Dtse Tuning Advisor for Mirosoft SQL Server 25. VLDB, pges , A. Ailmki et l. Weving Reltions for Che Performne. VLDB, pges , I. Alginnis, R. Borovi, M. Brno, S. Idreos, nd A. Ailmki. NoDB: Effiient Query Exeution on Rw Dt Files. In SIGMOD Conferene, pges , S. Blns et l. A Comprison of Join Algorithms for Log Proessing in MpRedue. SIGMOD, pges , N. Bruno nd S. Chudhuri. To Tune or not to Tune? A Lightweight Physil Design Alerter. In VLDB, pges , N. Bruno nd S. Chudhuri. An Online Approh to Physil Design Tuning. In ICDE, pges , N. Bruno nd S. Chudhuri. Physil Design Refinement: The Merge-Redue Approh. ACM TODS, 32(4), M. J. Cfrell nd C. Ré. Mniml: Reltionl Optimiztion for Dt-Intensive Progrms. WeDB, S. Chudhuri nd V. R. Nrsyy. An Effiient Cost-Driven Index Seletion Tool for Mirosoft SQL Server. In VLDB, pges , S. Chudhuri nd V. R. Nrsyy. Self-Tuning Dtse Systems: A Dede of Progress. In VLDB, pges 3 14, S. Chen. Cheeth: A High Performne, Custom Dt Wrehouse on Top of MpRedue. PVLDB, 3(1-2): , J. Den nd S. Ghemwt. MpRedue: A Flexile Dt Proessing Tool. CACM, 53(1):72 77, J. Dittrih nd J.-A. Quiné-Ruiz. Effiient Prllel Dt Proessing in MpRedue Workflows. PVLDB, 5, J. Dittrih, J.-A. Quiné-Ruiz, A. Jindl, Y. Krgin, V. Setty, nd J. Shd. Hdoop++: Mking Yellow Elephnt Run Like Cheeth (Without It Even Notiing). PVLDB, 3(1): , J. Dittrih, J.-A. Quiné-Ruiz, S. Rihter, S. Shuh, A. Jindl, nd J. Shd. Only Aggressive Elephnts re Fst Elephnts. PVLDB, 5(11): , J.-P. Dittrih, P. M. Fisher, nd D. Kossmnn. AGILE: Adptive Indexing for Context-Awre Informtion Filters. In SIGMOD, pges , M. Y. Eltkh et l. CoHdoop: Flexile Dt Plement nd Its Exploittion in Hdoop. PVLDB, 4(9): , S. J. Finkelstein et l. Physil Dtse Design for Reltionl Dtses. ACM TODS, 13(1):91 128, G. Grefe, F. Hlim, S. Idreos, H. A. Kuno, nd S. Mnegold. Conurreny Control for Adptive Indexing. PVLDB, 5(7): , G. Grefe nd H. A. Kuno. Self-seleting, self-tuning, inrementlly optimized indexes. In EDBT, pges , Hdoop Users, F. Hlim et l. Stohsti Dtse Crking: Towrds Roust Adptive Indexing in Min-Memory Column-Stores. PVLDB, 5(6):52 513, F. Hlim, S. Idreos, P. Krrs, nd R. H. C. Yp. Stohsti Dtse Crking: Towrds Roust Adptive Indexing in Min- Memory Column-Stores. PVLDB, 5(6):52 513, H. Herodotou nd S. Bu. Profiling, Wht-if Anlysis, nd Cost-sed Optimiztion of MpRedue Progrms. PVLDB, 4(11): , S. Idreos, I. Alginnis, R. Johnson, nd A. Ailmki. Here re my Dt Files. Here re my Queries. Where re myresults? In CIDR, pges 57 68, S. Idreos et l. Dtse Crking. In CIDR, pges 68 78, S. Idreos et l. Self-orgnizing tuple reonstrution in olumnstores. In SIGMOD, pges , S. Idreos et l. Merging Wht s Crked, Crking Wht s Merged: Adptive Indexing in Min-Memory Column-Stores. PVLDB, 4(9): , S. Idreos, M. L. Kersten, nd S. Mnegold. Updting Crked Dtse. In SIGMOD Conferene, pges , E. Jhni et l. Automti Optimiztion for MpRedue Progrms. PVLDB, 4(6): , D. Jing et l. The Performne of MpRedue: An In-depth Study. PVLDB, 3(1): , A. Jindl, J.-A. Quiné-Ruiz, nd J. Dittrih. Trojn Dt Lyouts: Right Shoes for Running Elephnt. SOCC, J. Lin et l. Full-Text Indexing for Optimizing Seletion Opertions in Lrge-Sle Dt Anlytis. MpRedue Workshop, D. Logothetis et l. In-Situ MpRedue for Log Proessing. USENIX, M. Lühring et l. Autonomous Mngement of Soft Indexes. In ICDE Workshop on Self-Mnging Dtse Systems, pges , C. Olston. Keynote: Progrmming nd Deugging Lrge-Sle Dt Proessing Workflows. SOCC, A. Pvlo et l. A Comprison of Approhes to Lrge-Sle Dt Anlysis. SIGMOD, pges , J.-A. Quiné-Ruiz, C. Pinkel, J. Shd, nd J. Dittrih. RAFTing MpRedue: Fst reovery on the RAFT. ICDE, pges 589 6, J. Shd, J. Dittrih, nd J.-A. Quiné-Ruiz. Runtime Mesurements in the Cloud: Oserving, Anlyzing, nd Reduing Vrine. PVLDB, 3(1):46 471, K. Shnitter et l. COLT: Continuous On-line Tuning. In SIG- MOD, pges , A. Thusoo et l. Dt Wrehousing nd Anlytis Infrstruture t Feook. SIGMOD, pges , T. White. Hdoop: The Definitive Guide. O Reilly, H.-C. Yng nd D. S. Prker. Trverse: Simplified Indexing on Lrge Mp-Redue-Merge Clusters. In DASFAA, pges , M. Zhri et l. Dely Sheduling: A Simple Tehnique for Ahieving Lolity nd Firness in Cluster Sheduling. EuroSys, pges , 21.
Enterprise Digital Signage Create a New Sign
Enterprise Digitl Signge Crete New Sign Intended Audiene: Content dministrtors of Enterprise Digitl Signge inluding stff with remote ess to sign.pitt.edu nd the Content Mnger softwre pplition for their
Active Directory Service
In order to lern whih questions hve een nswered orretly: 1. Print these pges. 2. Answer the questions. 3. Send this ssessment with the nswers vi:. FAX to (212) 967-3498. Or. Mil the nswers to the following
OUTLINE SYSTEM-ON-CHIP DESIGN. GETTING STARTED WITH VHDL August 31, 2015 GAJSKI S Y-CHART (1983) TOP-DOWN DESIGN (1)
August 31, 2015 GETTING STARTED WITH VHDL 2 Top-down design VHDL history Min elements of VHDL Entities nd rhitetures Signls nd proesses Dt types Configurtions Simultor sis The testenh onept OUTLINE 3 GAJSKI
Quick Guide to Lisp Implementation
isp Implementtion Hndout Pge 1 o 10 Quik Guide to isp Implementtion Representtion o si dt strutures isp dt strutures re lled S-epressions. The representtion o n S-epression n e roken into two piees, the
Inter-domain Routing
COMP 631: COMPUTER NETWORKS Inter-domin Routing Jsleen Kur Fll 2014 1 Internet-sle Routing: Approhes DV nd link-stte protools do not sle to glol Internet How to mke routing slle? Exploit the notion of
1. Definition, Basic concepts, Types 2. Addition and Subtraction of Matrices 3. Scalar Multiplication 4. Assignment and answer key 5.
. Definition, Bsi onepts, Types. Addition nd Sutrtion of Mtries. Slr Multiplition. Assignment nd nswer key. Mtrix Multiplition. Assignment nd nswer key. Determinnt x x (digonl, minors, properties) summry
Words Symbols Diagram. abcde. a + b + c + d + e
Logi Gtes nd Properties We will e using logil opertions to uild mhines tht n do rithmeti lultions. It s useful to think of these opertions s si omponents tht n e hooked together into omplex networks. To
Arc-Consistency for Non-Binary Dynamic CSPs
Ar-Consisteny for Non-Binry Dynmi CSPs Christin Bessière LIRMM (UMR C 9928 CNRS / Université Montpellier II) 860, rue de Sint Priest 34090 Montpellier, Frne Emil: [email protected] Astrt. Constrint stisftion
Innovation in Software Development Process by Introducing Toyota Production System
Innovtion in Softwre Development Proess y Introduing Toyot Prodution System V Koihi Furugki V Tooru Tkgi V Akinori Skt V Disuke Okym (Mnusript reeived June 1, 2006) Fujitsu Softwre Tehnologies (formerly
- DAY 1 - Website Design and Project Planning
Wesite Design nd Projet Plnning Ojetive This module provides n overview of the onepts of wesite design nd liner workflow for produing wesite. Prtiipnts will outline the sope of wesite projet, inluding
PLWAP Sequential Mining: Open Source Code
PL Sequentil Mining: Open Soure Code C.I. Ezeife Shool of Computer Siene University of Windsor Windsor, Ontrio N9B 3P4 ezeife@uwindsor. Yi Lu Deprtment of Computer Siene Wyne Stte University Detroit, Mihign
Practice Test 2. a. 12 kn b. 17 kn c. 13 kn d. 5.0 kn e. 49 kn
Prtie Test 2 1. A highwy urve hs rdius of 0.14 km nd is unnked. A r weighing 12 kn goes round the urve t speed of 24 m/s without slipping. Wht is the mgnitude of the horizontl fore of the rod on the r?
Student Access to Virtual Desktops from personally owned Windows computers
Student Aess to Virtul Desktops from personlly owned Windows omputers Mdison College is plesed to nnoune the ility for students to ess nd use virtul desktops, vi Mdison College wireless, from personlly
Chapter. Contents: A Constructing decimal numbers
Chpter 9 Deimls Contents: A Construting deiml numers B Representing deiml numers C Deiml urreny D Using numer line E Ordering deimls F Rounding deiml numers G Converting deimls to frtions H Converting
BUSINESS PROCESS MODEL TRANSFORMATION ISSUES The top 7 adversaries encountered at defining model transformations
USINESS PROCESS MODEL TRANSFORMATION ISSUES The top 7 dversries enountered t defining model trnsformtions Mrion Murzek Women s Postgrdute College for Internet Tehnologies (WIT), Institute of Softwre Tehnology
Data Security 1. 1 What is the function of the Jump instruction? 2 What are the main parts of the virus code? 3 What is the last act of the virus?
UNIT 18 Dt Seurity 1 STARTER Wht stories do you think followed these hedlines? Compre nswers within your group. 1 Love ug retes worldwide hos. 2 Hkers rk Mirosoft softwre odes. 3 We phone sm. Wht other
1 GSW IPv4 Addressing
1 For s long s I ve een working with the Internet protools, people hve een sying tht IPv6 will e repling IPv4 in ouple of yers time. While this remins true, it s worth knowing out IPv4 ddresses. Even when
Reasoning to Solve Equations and Inequalities
Lesson4 Resoning to Solve Equtions nd Inequlities In erlier work in this unit, you modeled situtions with severl vriles nd equtions. For exmple, suppose you were given usiness plns for concert showing
How To Organize A Meeting On Gotomeeting
NOTES ON ORGANIZING AND SCHEDULING MEETINGS Individul GoToMeeting orgnizers my hold meetings for up to 15 ttendees. GoToMeeting Corporte orgnizers my hold meetings for up to 25 ttendees. GoToMeeting orgnizers
McAfee Network Security Platform
XC-240 Lod Blner Appline Quik Strt Guide Revision D MAfee Network Seurity Pltform This quik strt guide explins how to quikly set up nd tivte your MAfee Network Seurity Pltform XC-240 Lod Blner. The SFP+
Architecture and Data Flows Reference Guide
Arhiteture nd Dt Flows Referene Guide BES12 Version 12.3 Pulished: 2015-10-14 SWD-20151014125318579 Contents Aout this guide... 5 Arhiteture: BES12 EMM solution... 6 Components used to mnge BlkBerry 10,
KEY SKILLS INFORMATION TECHNOLOGY Level 3. Question Paper. 29 January 9 February 2001
KEY SKILLS INFORMATION TECHNOLOGY Level 3 Question Pper 29 Jnury 9 Ferury 2001 WHAT YOU NEED This Question Pper An Answer Booklet Aess to omputer, softwre nd printer You my use ilingul ditionry Do NOT
VMware Horizon FLEX Administration Guide
VMwre Horizon FLEX Administrtion Guide Horizon FLEX 1.0 This doument supports the version of eh produt listed nd supports ll susequent versions until the doument is repled y new edition. To hek for more
GENERAL OPERATING PRINCIPLES
KEYSECUREPC USER MANUAL N.B.: PRIOR TO READING THIS MANUAL, YOU ARE ADVISED TO READ THE FOLLOWING MANUAL: GENERAL OPERATING PRINCIPLES Der Customer, KeySeurePC is n innovtive prout tht uses ptente tehnology:
Calculating Principal Strains using a Rectangular Strain Gage Rosette
Clulting Prinipl Strins using Retngulr Strin Gge Rosette Strin gge rosettes re used often in engineering prtie to determine strin sttes t speifi points on struture. Figure illustrtes three ommonly used
VMware Horizon FLEX Administration Guide
VMwre Horizon FLEX Administrtion Guide Horizon FLEX 1.1 This doument supports the version of eh produt listed nd supports ll susequent versions until the doument is repled y new edition. To hek for more
European Convention on Social and Medical Assistance
Europen Convention on Soil nd Medil Assistne Pris, 11.XII.1953 Europen Trety Series - No. 14 The governments signtory hereto, eing memers of the Counil of Europe, Considering tht the im of the Counil of
Cell Breathing Techniques for Load Balancing in Wireless LANs
1 Cell rething Tehniques for Lod lning in Wireless LANs Yigl ejerno nd Seung-Je Hn ell Lortories, Luent Tehnologies Astrt: Mximizing the network throughput while providing firness is one of the key hllenges
SOLVING EQUATIONS BY FACTORING
316 (5-60) Chpter 5 Exponents nd Polynomils 5.9 SOLVING EQUATIONS BY FACTORING In this setion The Zero Ftor Property Applitions helpful hint Note tht the zero ftor property is our seond exmple of getting
Maximum area of polygon
Mimum re of polygon Suppose I give you n stiks. They might e of ifferent lengths, or the sme length, or some the sme s others, et. Now there re lots of polygons you n form with those stiks. Your jo is
How To Balance Power In A Distribution System
NTERNATONA JOURNA OF ENERG, ssue 3, ol., 7 A dynmilly S bsed ompt ontrol lgorithm for lod blning in distribution systems A. Kzemi, A. Mordi Koohi nd R. Rezeipour Abstrt An lgorithm for pplying fixed pitor-thyristorontrolled
Ratio and Proportion
Rtio nd Proportion Rtio: The onept of rtio ours frequently nd in wide vriety of wys For exmple: A newspper reports tht the rtio of Repulins to Demorts on ertin Congressionl ommittee is 3 to The student/fulty
SECTION 7-2 Law of Cosines
516 7 Additionl Topis in Trigonometry h d sin s () tn h h d 50. Surveying. The lyout in the figure t right is used to determine n inessile height h when seline d in plne perpendiulr to h n e estlished
1 Fractions from an advanced point of view
1 Frtions from n vne point of view We re going to stuy frtions from the viewpoint of moern lger, or strt lger. Our gol is to evelop eeper unerstning of wht n men. One onsequene of our eeper unerstning
European Convention on Products Liability in regard to Personal Injury and Death
Europen Trety Series - No. 91 Europen Convention on Produts Liility in regrd to Personl Injury nd Deth Strsourg, 27.I.1977 The memer Sttes of the Counil of Europe, signtory hereto, Considering tht the
Seeking Equilibrium: Demand and Supply
SECTION 1 Seeking Equilirium: Demnd nd Supply OBJECTIVES KEY TERMS TAKING NOTES In Setion 1, you will explore mrket equilirium nd see how it is rehed explin how demnd nd supply intert to determine equilirium
National Firefighter Ability Tests And the National Firefighter Questionnaire
Ntionl Firefighter Aility Tests An the Ntionl Firefighter Questionnire PREPARATION AND PRACTICE BOOKLET Setion One: Introution There re three tests n questionnire tht mke up the NFA Tests session, these
SE3BB4: Software Design III Concurrent System Design. Sample Solutions to Assignment 1
SE3BB4: Softwre Design III Conurrent System Design Winter 2011 Smple Solutions to Assignment 1 Eh question is worth 10pts. Totl of this ssignment is 70pts. Eh ssignment is worth 9%. If you think your solution
Module 5. Three-phase AC Circuits. Version 2 EE IIT, Kharagpur
Module 5 Three-hse A iruits Version EE IIT, Khrgur esson 8 Three-hse Blned Suly Version EE IIT, Khrgur In the module, ontining six lessons (-7), the study of iruits, onsisting of the liner elements resistne,
OxCORT v4 Quick Guide Revision Class Reports
OxCORT v4 Quik Guie Revision Clss Reports This quik guie is suitble for the following roles: Tutor This quik guie reltes to the following menu options: Crete Revision Clss Reports pg 1 Crete Revision Clss
SOLVING QUADRATIC EQUATIONS BY FACTORING
6.6 Solving Qudrti Equtions y Ftoring (6 31) 307 In this setion The Zero Ftor Property Applitions 6.6 SOLVING QUADRATIC EQUATIONS BY FACTORING The tehniques of ftoring n e used to solve equtions involving
The remaining two sides of the right triangle are called the legs of the right triangle.
10 MODULE 6. RADICAL EXPRESSIONS 6 Pythgoren Theorem The Pythgoren Theorem An ngle tht mesures 90 degrees is lled right ngle. If one of the ngles of tringle is right ngle, then the tringle is lled right
The art of Paperarchitecture (PA). MANUAL
The rt of Pperrhiteture (PA). MANUAL Introution Pperrhiteture (PA) is the rt of reting three-imensionl (3D) ojets out of plin piee of pper or ror. At first, esign is rwn (mnully or printe (using grphil
The Cat in the Hat. by Dr. Seuss. A a. B b. A a. Rich Vocabulary. Learning Ab Rhyming
MINI-LESSON IN TION The t in the Ht y Dr. Seuss Rih Voulry tme dj. esy to hndle (not wild) LERNING Lerning Rhyming OUT Words I know it is wet nd the sun is not sunny. ut we n hve Lots of good fun tht is
A System Context-Aware Approach for Battery Lifetime Prediction in Smart Phones
A System Context-Awre Approh for Bttery Lifetime Predition in Smrt Phones Xi Zho, Yo Guo, Qing Feng, nd Xingqun Chen Key Lbortory of High Confidene Softwre Tehnologies (Ministry of Edution) Shool of Eletronis
A Language-Neutral Representation of Temporal Information
A Lnguge-Neutrl Representtion of Temporl Informtion Rihrd Cmpell*, Tkko Aikw, Zixin Jing, Crmen Lozno, Mite Melero nd Andi Wu Mirosoft Reserh One Mirosoft Wy, Redmond, WA 98052 USA {rihmp, tkko, jingz,
EQUATIONS OF LINES AND PLANES
EQUATIONS OF LINES AND PLANES MATH 195, SECTION 59 (VIPUL NAIK) Corresponding mteril in the ook: Section 12.5. Wht students should definitely get: Prmetric eqution of line given in point-direction nd twopoint
BEC TESTS Gli ascolti sono disponibili all indirizzo www.loescher.it/business
Gli solti sono disponiili ll indirizzo www.loesher.it/usiness SURNAME AND NAME CLASS DATE BEC TEST Prt one Questions 1-8 For questions 1-8 you will her eight short reordings. For eh question, hoose one
c b 5.00 10 5 N/m 2 (0.120 m 3 0.200 m 3 ), = 4.00 10 4 J. W total = W a b + W b c 2.00
Chter 19, exmle rolems: (19.06) A gs undergoes two roesses. First: onstnt volume @ 0.200 m 3, isohori. Pressure inreses from 2.00 10 5 P to 5.00 10 5 P. Seond: Constnt ressure @ 5.00 10 5 P, isori. olume
AntiSpyware Enterprise Module 8.5
AntiSpywre Enterprise Module 8.5 Product Guide Aout the AntiSpywre Enterprise Module The McAfee AntiSpywre Enterprise Module 8.5 is n dd-on to the VirusScn Enterprise 8.5i product tht extends its ility
INSTALLATION, OPERATION & MAINTENANCE
DIESEL PROTECTION SYSTEMS Exhust Temperture Vlves (Mehnil) INSTALLATION, OPERATION & MAINTENANCE Vlve Numer TSZ-135 TSZ-150 TSZ-200 TSZ-275 TSZ-392 DESCRIPTION Non-eletril temperture vlves mnuftured in
Analysis of Algorithms and Data Structures for Text Indexing Moritz G. Maaß
FAKULTÄT FÜR INFORMATIK TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Effiziente Algorithmen Anlysis of Algorithms nd Dt Strutures for Text Indexing Moritz G. Mß FAKULTÄT FÜR INFORMATIK TECHNISCHE UNIVERSITÄT
Fundamentals of Cellular Networks
Fundmentls of ellulr Networks Dvid Tipper Assoite Professor Grdute Progrm in Teleommunitions nd Networking University of Pittsburgh Slides 4 Telom 2720 ellulr onept Proposed by ell Lbs 97 Geogrphi Servie
Would your business survive a crisis? A guide to business continuity planning. www.staffordbc.gov.uk
Would your usiness survive risis? A guide to usiness ontinuity plnning www.stfford.gov.uk 2 A guide to Business Continuity Plnning A guide to usiness ontinuity plnning Contents The Lw Wht type of inidents
Lesson 1: Getting started
Answer key 0 Lesson 1: Getting strte 1 List the three min wys you enter t in QuikBooks. Forms, lists, registers 2 List three wys to ess fetures in QuikBooks. Menu r, Ion Br, Centers, Home pge 3 Wht ookkeeping
Regular Sets and Expressions
Regulr Sets nd Expressions Finite utomt re importnt in science, mthemtics, nd engineering. Engineers like them ecuse they re super models for circuits (And, since the dvent of VLSI systems sometimes finite
Small Business Cloud Services
Smll Business Cloud Services Summry. We re thick in the midst of historic se-chnge in computing. Like the emergence of personl computers, grphicl user interfces, nd mobile devices, the cloud is lredy profoundly
Orthodontic marketing through social media networks: The patient and practitioner s perspective
Originl rtile Orthodonti mrketing through soil medi networks: The ptient nd prtitioner s perspetive Kristin L. Nelson ; Bhvn Shroff ; l M. Best ; Steven J. Linduer d BSTRCT Ojetive: To (1) ssess orthodonti
Revised products from the Medicare Learning Network (MLN) ICD-10-CM/PCS Myths and Facts, Fact Sheet, ICN 902143, downloadable.
DEPARTMENT OF HEALTH AND HUMAN SERVICES Centers for Meire & Meii Servies Revise prouts from the Meire Lerning Network (MLN) ICD-10-CM/PCS Myths n Fts, Ft Sheet, ICN 902143, ownlole. MLN Mtters Numer: SE1325
MATH PLACEMENT REVIEW GUIDE
MATH PLACEMENT REVIEW GUIDE This guie is intene s fous for your review efore tking the plement test. The questions presente here my not e on the plement test. Although si skills lultor is provie for your
THE LONGITUDINAL FIELD IN THE GTEM 1750 AND THE NATURE OF THE TERMINATION.
THE LONGITUDINAL FIELD IN THE GTEM 175 AND THE NATURE OF THE TERMINATION. Benjmin Guy Loder Ntionl Physil Lbortory, Queens Rod, Teddington, Middlesex, Englnd. TW11 LW Mrtin Alexnder Ntionl Physil Lbortory,
How To Network A Smll Business
Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology
If two triangles are perspective from a point, then they are also perspective from a line.
Mth 487 hter 4 Prtie Prolem Solutions 1. Give the definition of eh of the following terms: () omlete qudrngle omlete qudrngle is set of four oints, no three of whih re olliner, nd the six lines inident
Interpreting the Mean Comparisons Report
Interpreting the Men Comprisons Report Smple The Men Comprisons report is bsed on informtion from ll rndomly seleted students for both your institution nd your omprison institutions. 1 Trgeted oversmples
Small Business Networking
Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology
Small Businesses Decisions to Offer Health Insurance to Employees
Smll Businesses Decisions to Offer Helth Insurnce to Employees Ctherine McLughlin nd Adm Swinurn, June 2014 Employer-sponsored helth insurnce (ESI) is the dominnt source of coverge for nonelderly dults
Appendix D: Completing the Square and the Quadratic Formula. In Appendix A, two special cases of expanding brackets were considered:
Appendi D: Completing the Squre nd the Qudrtic Formul Fctoring qudrtic epressions such s: + 6 + 8 ws one of the topics introduced in Appendi C. Fctoring qudrtic epressions is useful skill tht cn help you
Small Business Networking
Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology
the machine and check the components
Quik Setup Guide Strt Here MFC-7360N MFC-7460DN Plese red the Sfety nd Legl ooklet first efore you set up your mhine. Then, plese red this Quik Setup Guide for the orret setup nd instlltion. To view the
Vendor Rating for Service Desk Selection
Vendor Presented By DATE Using the scores of 0, 1, 2, or 3, plese rte the vendor's presenttion on how well they demonstrted the functionl requirements in the res below. Also consider how efficient nd functionl
Introductory Information. Setup Guide. Introduction. Space Required for Installation. Overview of Setup. The Manuals Supplied with This Printer ENG
Introdutory Informtion Introdution Setup Guide ENG Red this mnul efore ttempting to operte the printer. Keep this mnul in hndy lotion for future referene. Overview of Setup These re the steps in printer
Small Business Networking
Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology
Qualmark Licence Agreement
Terms nd Conditions Qulmrk Liene Agreement Terms nd Conditions Terms nd Conditions 1. Liene Holder Applint 2. Confirmed Sttus 3. Term nd Renewl 4. Use of the Intelletul Property 5. Qulmrk Progrmme Rtings
How To Set Up A Network For Your Business
Why Network is n Essentil Productivity Tool for Any Smll Business TechAdvisory.org SME Reports sponsored by Effective technology is essentil for smll businesses looking to increse their productivity. Computer
Small Business Networking
Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology
DiaGen: A Generator for Diagram Editors Based on a Hypergraph Model
DiGen: A Genertor for Digrm Eitors Bse on Hypergrph Moel G. Viehstet M. Mins Lehrstuhl für Progrmmiersprhen Universität Erlngen-Nürnerg Mrtensstr. 3, 91058 Erlngen, Germny Emil: fviehste,[email protected]
Start Here. Quick Setup Guide. the machine and check the components. NOTE Not all models are available in all countries.
Quik Setup Guide Strt Here HL-3140CW / HL-3150CDN HL-3150CDW / HL-3170CDW Thnk you for hoosing Brother, your support is importnt to us nd we vlue your usiness. Your Brother produt is engineered nd mnuftured
TOA RANGATIRA TRUST. Deed of Trust 3714386.2
TOA RANGATIRA TRUST Deed of Trust 1 Deed dted 2011 Prties 1 Te Runng o To Rngtir Inorported n inorported soiety hving its registered offie t Poriru (the Runng ) Bkground A B C D The Runng is n inorported
Morgan Stanley Ad Hoc Reporting Guide
spphire user guide Ferury 2015 Morgn Stnley Ad Hoc Reporting Guide An Overview For Spphire Users 1 Introduction The Ad Hoc Reporting tool is ville for your reporting needs outside of the Spphire stndrd
Bayesian Updating with Continuous Priors Class 13, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom
Byesin Updting with Continuous Priors Clss 3, 8.05, Spring 04 Jeremy Orloff nd Jonthn Bloom Lerning Gols. Understnd prmeterized fmily of distriutions s representing continuous rnge of hypotheses for the
Review. Scan Conversion. Rasterizing Polygons. Rasterizing Polygons. Triangularization. Convex Shapes. Utah School of Computing Spring 2013
Uth Shool of Computing Spring 2013 Review Leture Set 4 Sn Conversion CS5600 Computer Grphis Spring 2013 Line rsteriztion Bsi Inrementl Algorithm Digitl Differentil Anlzer Rther thn solve line eqution t
Homework 3 Solutions
CS 341: Foundtions of Computer Science II Prof. Mrvin Nkym Homework 3 Solutions 1. Give NFAs with the specified numer of sttes recognizing ech of the following lnguges. In ll cses, the lphet is Σ = {,1}.
LISTENING COMPREHENSION
PORG, přijímí zkoušky 2015 Angličtin B Reg. číslo: Inluded prts: Points (per prt) Points (totl) 1) Listening omprehension 2) Reding 3) Use of English 4) Writing 1 5) Writing 2 There re no extr nswersheets
Econ 4721 Money and Banking Problem Set 2 Answer Key
Econ 472 Money nd Bnking Problem Set 2 Answer Key Problem (35 points) Consider n overlpping genertions model in which consumers live for two periods. The number of people born in ech genertion grows in
Fluent Merging: A General Technique to Improve Reachability Heuristics and Factored Planning
Fluent Merging: A Generl Tehnique to Improve Rehility Heuristis n Ftore Plnning Menkes vn en Briel Deprtment of Inustril Engineering Arizon Stte University Tempe AZ, 85287-8809 [email protected] Suro Kmhmpti
Before you can use the machine, please read this Quick Setup Guide for the correct setup and installation.
Quik Setup Guide Strt Here DCP-365CN DCP-373CW DCP-375CW DCP-377CW Before you n use the mhine, plese red this Quik Setup Guide for the orret setup nd instlltion. WARNING CAUTION Wrnings tell you wht to
Interactive Phone Call: Synchronous Remote Collaboration and Projected Interactive Surfaces
Intertive Phone Cll: Synhronous Remote Collortion nd Projeted Intertive Surfes Christin Winkler, Christin Reinrtz, Din Nowk, Enrio Rukzio MHCI Group, pluno The Ruhr Institute for Softwre Tehnology, University
