Analyzing Failures of a Semi-Structured Supercomputer Log File Efficiently by Using PIG on Hadoop

Intrnational Journal of Computr Scinc and Enginring Opn Accss Rsarch Papr Volum-2, Issu-1 E-ISSN: 2347-2693 Analyzing Failurs of a Smi-Structurd Suprcomputr Log Fil Efficintly by Using PIG on Hadoop Madhuri Srinivas Pall 1*, Konisa Jyothsna 2 & B. Anusha 3 1* Computr Scinc Enginring, National Institut of Tchnology, Warangal, India, madhuri.pall@yahoo.com 2 Computr Scinc Enginring, National Institut of Tchnology, Warangal, India, jyothsna1503@gmail.com 3 Computr Scinc Enginring, National Institut of Tchnology, Warangal, India, anushabattula321@gmail.com www.ijcaonlin.org Rcivd: 12/12/2013 Rvisd: 20/12/2013 Accptd: 28/12/2013 Publishd: 31/Jan/2014 Abstract Data sts usd to ful th rcntly popular concpt of businss intllignc ar bcoming incrasingly larg. Convntional databas managmnt softwar is no longr fficint nough howvr; paralll databas managmnt systms and massiv data-scal procssing systms lik MapRduc indd look promising. Although, MapRduc is a good option, it is difficult to work with, as th programmr would hav to think at th mappr and rducr lvl. In this papr, w prsnt a simpl yt fficint way to min usful information whr a program can b writtn as a sris of stps. W hav qurid a suprcomputr log fil using Apach s Hadoop and PIG, obtaind rsults as to whn and why th suprcomputr had faild and compard ths rsults to that of a traditional program. Kywords- Big Data, Paralll Procssing, Hadoop, MapRduc, Data Mining, Businss Intllignc, PIG, Log fil analysis, Suprcomputr I. INTRODUCTION Nowadays, businss intllignc has bcom a popular trnd. What usd to b trial and rror has now dvlopd into dcisions basd on past xprincs and othr data. Th fficacy of mining usful information is vr changing and hnc, mor and mor mthods to mak mining fficint ar bing proposd. In this papr, w will propos a mthod to analyz th log fil of a suprcomputr from th bio-informatics background. Log fils of this suprcomputr ar not only vast, but ar also smi-structurd. Th mthod proposd dos not only prov to b mor fficint than th ons alrady xistnt, but also happns to b much simplr. On of th most fficint solutions to vast log fil analysis is using a paralll procssing systm. Howvr, without using a qurying languag, th programmr will nd to hav indpth knowldg about MapRduc. In many situations, th programmr ithr dos not hav such xprinc or would lik a solution in which much training is not rquird. Also, rsultant data is usually rquird within a short intrval as most ral-tim data workd with has high vlocity, which rquirs programmrs to provid instant analysis. W prsnt a solution that allows popl to b abl to min information fficintly on a paralll procssing systm without rquiring much training or background knowldg. As a PIG program is nothing but a sris of stps, or in othr words stpwis instructions, it asy to us, kping in mind th gnral population. W hav implmntd this qurying languag into our solution for analyzing th failurs of a suprcomputr. Corrsponding Author: Madhuri Srinivas Pall Th objctiv of th papr is to analyz th failurs of a suprcomputr by mining its log fil in two ways and to compar th rsults of both th mthods. Th first way is a traditional program writtn in Java that will rad th fil and output th rquird, rlvant information. Th scond mthod is to stor th fils in th HDFS (Hadoop Distributd Fil Systm), writ a PIG qury on MapRduc that will giv th smi-structurd log som schma to work with and output th rquird rsults back to th HDFS. Th xact implmntation has bn xplaind in th rspctiv sction of th papr. By mining both ways, th tim takn by both th mthods for diffrnt sizs of data was compard to s which mthod is mor fficint and fasibl. As hypothsizd, th mthod involving th usag of MapRduc and th PIG qurying languag has provn to b much fastr. Statistics hav bn prsntd in th rsults sction of th papr. In th nxt sction, w hav writtn th background rlvant to our rsarch. This contains a brif dscription about MapRduc and PIG. Sction III and IV contain th nd of th problm statmnt and th implmntation rspctfully. Rsults hav bn portrayd in th Sction V and sction VI contains th conclusion and scop for furthr rsarch. Th papr nds with a list of rfrncs usd for th rsarch of our papr. 2014, IJCSE All Rights Rsrvd 1 II. BACKGROUND Big Data, as pr th Gartnr dfinition, rfrs to data that has on or mor than on of th following thr assts; volum, vlocity and varity. Th analysis of such data is known to b usful in businss intllignc and dcision-

making. Ovr th yars, data mining has bn a hot trnd all around th world. Data analysts min larg amounts of data to spot som kind or roccurring trnds that may just provid a comptitiv dg. As mntiond abov, data can b of diffrnt formats. In ordr to analyz smi-structurd and unstructurd data fficintly, w can us Hadoop. Hadoop is an opn sourc softwar framwork. Its architctur can b broadly classifid into two parts; th Hadoop Distributd Fil systm (HDFS), which is it s storag mchanism and th Hadoop distributd computational mchanism, which is popularly known as MapRduc. On a fully configurd Hadoop clustr, thr ar fiv running damons; th namnod, th datanod, th jobtrackr, th tasktrackr and th scondary namnod. Th namnod and datanod hav a mastr-slav rlationship. Th namnod is th mastr, contains fil s mtadata and kps track of which blocks of data go to which nod. Th datanods do th grunt work of rading and writing HDFS data blocks to fil systm. Th scondary namnod taks scrnshots of th namnod at rgular intrvals for failur rcovry purposs. Th jobtrackr and tasktrackr hav rlationship similar to that of th namnod and datanod. Whil th jobtrackr monitors all th tasks, th tasktrackr is rsponsibl for individual tasks. Th ovrall topology is dpictd in th following diagram. Th MapRduc modl s data flow contains two main stps; map and rduc. In brif, th Map stp consists of th mastr nod diving th data input into sub-problms and distributing ths problms to th working nods. Th working nods can furthr sub-divid thir own problm and distributu it amongst thir own working nods in th form of a multi-lvl tr. At any itrativ lvl, th working nod computs th rquird output and rports back to it's mastr. Th Rduc stp is somwhat lik a summary stp. It involvs th collction of all th subproblm outputs, combins thm and producs th originally rquird answr. Togthr, thy form th framwork of a distributd systm. PIG is an xtnsion of Hadoop which is usd to simplify th unncssary complxity of MapRduc. It contains two main componnts; a high-lvl languag PIG Latin and a compilr which is usually Hadoop. Complx tasks can b xplicitly writtn as data flow squncs and hnc maks it asir to writ and maintain any particular program. Th usr is givn th opportunity to concntrat on th smantics of th program rathr than optimization as this is automatically don whil using PIG. Pig quris can b writtn in thr ways. Th first way is by using an intractiv grunt shll. Th scond way is to writ a script fil, which is usually for larg, rptitiv programs and th third way is mbdding th quris in a java program. Also, it runs on two diffrnt mods, th local and th Hadoop mod. In this papr, w ar using PIG in th maprduc mod. Ways of Running Pig Latin Script Mthod Grunt intractiv shll Script fil Embddd quris in java program Gnrally usd for ad hoc data Manually ntr lin by lin Usd for larg, rptitiv pig programs Format: pig myscript.pig Pig srvr class allows any Java program to xcut qury A sampl qury in PIG is givn blow(fig.2.1). Th LOAD command will load th fil path as pr th dfind schma using th dlimitr spac. Hr th data typ takn is th dfault chararray. Th alias grp contains th filtrd tupls according to th condition spcifid. In this qury, pattrn matching was usd to chck th dat of th tupl. Th LIMIT oprator limits th total numbr of tupls to 10 for convninc sak and th alias cntd rprsnts ths 10 tupls. To s what is happning bhind th scns, PIG provids thr diagnostic oprators ILLUSTRATE, EXPLAIN and DESCRIBE. Fig.2.1 Dscription grunt> log= LOAD /usr/mady/projct.log USING PigStorag( ) AS (month,day,tim,info1,info2,info3); grunt> grp= FILTER log BY (day matchs.*30.* ); grunt> cntd= LIMIT grp 10; grunt> ILLUSTRATE cntd; 2014, IJCSE All Rights Rsrvd 2

Each sparat command can b distinguishd by th grunt> shll indication. Th scrnshot blow(fig.2.2) shows how a fw of th columns of th schma chang by using th ILLUSTRATE command. and a mssag. W hav takn ach pic of information as a column by dfining a schma. Fig.4.1 Fig.2.2 On of th advantags of PIG ovr SQL is that it has a loosr schma approach than that of SQL, which thrfor, maks it mor suitabl for smi-structurd and unstructurd data. III. IMPORTANCE OF THE STUDY PROBLEM STATEMENT Analysis of big data has bcom of utmost priority today. Spcifically, analysis of log fils is rquird in ordr to gaug and undrstand th failurs of th suprcomputr. By doing so, futur failurs and losss can b prvntd. In this papr, w hav proposd an fficint solution for analyzing th log fils of a suprcomputr compard to that of a traditional program. W hav also takn into considration th fact that popl attmpting to analyz ths fils may not hav much xprinc and knowldg in th MapRduc domain. Th solution proposd in this papr is important as it can b usd b a larg population, not just tchnical xprts. W hav targtd a mor divrs population as potntial usrs of this proposd solution. Also, this is an apt solution for not just structurd logs, but smi-structurd logs as wll. Using pig w can comput on distributd nvironmnt so how much vr is th siz of th input w can comput th rsults. IV. IMPLEMENTATION W hav usd PIG on Hadoop vrsion 1.0.4. Firstly, w hav takn a sampl smi-structurd suprcomputr log fil, a part of which is shown blow(fig.4.1). In this fil, thr is information lik th dat, tim, krnl in qustion Th log fil was copid to th HDFS. Whil writing th qury, w sarchd for all logs in which th suprcomputr had faild. Th qury was xcutd in th MapRduc mod of PIG and was tstd on th intractiv grunt shll as wll as by saving it as a script fil. Aftr MapRduc functions wr prformd, ths logs wr stord in a output fil in th HDFS for viwing. W had also prformd quris for usr-dfind quris such as which activitis wr going on at a particular usr-dfind tim or on a usr-dfind dat. A scrnshot of th statistics for a sampl fil ar shown blow(fig.4.2). Fig.4.2 As sn in th statistics scrnshot, various masurs hav bn displayd including th job_id, maps, rducs, maximum map tim, minimum map tim, avrag map tim, mdian map tim, maximum rduc tim, minimum rduc tim, avrag rduc tim, mdian rduc tim, aliass and fatur outputs. Also, various countrs hav bn displayd such as th total numbr of rcords writtn. 2014, IJCSE All Rights Rsrvd 3

To viw th output rcords of th qury, w accss th HDFS. Blow(fig.4.3) is th output of th sampl log fil. Blow (tabl 5.2) ar th valus takn to draw th graph. In th Hadoop mthod, a fw initial valus ar unstabl. This is probably du to th diffrnc in numbr of maps allottd, th Hadoop nvironmnt and numbr of rsourcs availabl. Th traditional mthod shows no such inconsistncy. Fig.4.3 In this papr, w hav dtrmind th fficincy of using PIG compard to that of a traditional Java program by taking diffrnt sizd log fils and analyzing thm both ways. Each tim, w doubld th siz of th log fil in ordr to gt an xact pictur as to which mthod is suitabl for which kind of fil (i.. small, mdium, larg). Th rsults of th prformanc chck ar givn in th nxt sction of th papr. V. RESULT & ANALYSIS On taking diffrnt sizd fils, w hav obsrvd that by using Hadoop and PIG, log fils can b analyzd much mor fficintly compard to that of a traditional Java program. Not only dos th tim takn for th fil to b analyzd rduc considrably by using this mthod, but th complxity of programming with this mthod has also provn to b much mor usr-frindly. As Hadoop s architctur uss paralll procssing, for largr fils it is much mor fasibl. That is, to procss largr fils, mor maps can b usd in ordr to maintain fficincy. Th graph shown in th nxt column (fig.5.1) was mad by rcording th tim takn by diffrnt sizd fils to hav thir logs procssd in millisconds. This was don for both th mthods in our comparison. s ) d n c o ilis ( m tim n tio c u x E 90 000 80 000 70 000 60 000 50 000 40 000 30 000 20 000 10 000 0 analyzing compltly unstructurd data. P r fo r m an c C o m p a ri si o n 0 5 0 0 1 0 0 0 1 5 0 0 Fig.5.1 Siz (mgabyt) Traditional mthod (milli sc) Hadoop (milli sc) 0.093 49 12 0.18 70 63 0.37 79 49 0.74 106 50 1.5 147 58 2.9 328 102 5.8 506 57 11.6 901 60 23.3 2093 63 46.6 3458 77 93 8126 80 188 18920 133 376 22326 324 752 36206 437 1504 76638 807 VI. Tabl 5.2 CONCLUSION According to xprimntal rsults, th mthod involving using th qurying languag PIG on Hadoop has provn to b scalabl, rliabl,fastr,and fficint. On of th major advantags bsids fficincy is th fact that it is simpl and asy to us, hnc targting a widr audinc. Th mthod rliably analyzs th log fils, and hnc provs that it is usful for structurd as wll as smi-structurd data. Extnsions of this papr could includ a solution for REFERENCES [1]. T. Whit, Hadoop: Th Dfinitiv Guid. Yahoo Prss,2010. [2]. Chuck Lam,Pig:Hadoop in Action. [3]. J. Dan and S. Ghmawat, Maprduc: Simplifid Data Procssing on Larg Clustrs, Comm. of th ACM,Vol. 51, no. 1, pp. 107 113, 2008. [4]. C. Olston, B. Rd, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin: A Not-So-Forign Languag for Data Procssing, Proc. of th 2008 ACM SIGMOD intrnational confrncon Managmnt of Data, 2008, pp. 1099 1110. [5]. Thomas Ridmistr, Mohammad Ahmad Munawar, Miao Jiang, Paul A.S.Ward, "Diagnosis of Rcurrnt 2014, IJCSE All Rights Rsrvd 4

Faults using Log Fils," Proc. of th 2009 Confrnc of th Cntr for Advancd Studis on Collaborativ Rsarch,Novmbr 2009, pp. 12-23. [6]. Apach. Hadoop: Opn-sourc implmntation of MapRduc. http://hadoop.apach.org. [7]. Apach. Pig: High-lvl data ow systm for Hadoop. http://www.pig.apach.org [8]. Michal Cardosa, Chnyu Wang, Anshuman Nangia, Abhishk Chandra, Jon Wissman,"Exploring MapRduc fficincy with highly-distributd data" Proc. of th scond intrnational workshop on MapRduc and its applications",jun 2011, pp. 27-34. [9]. H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parkr, Map-rducmrg: simplifid rlational data procssing on larg clustrs, proc. of th SIGMOD Confrnc, 2007, pp. 1029 1040. [10]. A. Gats, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Rd, S. Srinivasan, and U. Srivastava., Building a High-Lvl Dataflow Systm on Top of Map-Rduc: Th Pig Exprinc. Proc. of th VLDB Endowmnt, vol. 2,no. 2, 2009. [11]. A tutorial on pig: http://www.pig-tutorial.blogspot.in/ 2014, IJCSE All Rights Rsrvd 5