Analyzing Failures of a Semi-Structured Supercomputer Log File Efficiently by Using PIG on Hadoop



Similar documents
The example is taken from Sect. 1.2 of Vol. 1 of the CPN book.

C H A P T E R 1 Writing Reports with SAS

EFFECT OF GEOMETRICAL PARAMETERS ON HEAT TRANSFER PERFORMACE OF RECTANGULAR CIRCUMFERENTIAL FINS

5 2 index. e e. Prime numbers. Prime factors and factor trees. Powers. worked example 10. base. power

Architecture of the proposed standard

An Broad outline of Redundant Array of Inexpensive Disks Shaifali Shrivastava 1 Department of Computer Science and Engineering AITR, Indore

FACULTY SALARIES FALL NKU CUPA Data Compared To Published National Data

A Project Management framework for Software Implementation Planning and Management

by John Donald, Lecturer, School of Accounting, Economics and Finance, Deakin University, Australia

Cisco Data Virtualization

Adverse Selection and Moral Hazard in a Model With 2 States of the World

QUANTITATIVE METHODS CLASSES WEEK SEVEN

Category 7: Employee Commuting

Information Management Strategy: Exploiting Big data and Advanced Analytics

Data warehouse on Manpower Employment for Decision Support System

Keywords Cloud Computing, Service level agreement, cloud provider, business level policies, performance objectives.

WORKERS' COMPENSATION ANALYST, 1774 SENIOR WORKERS' COMPENSATION ANALYST, 1769

Key Management System Framework for Cloud Storage Singa Suparman, Eng Pin Kwang Temasek Polytechnic

Sci.Int.(Lahore),26(1), ,2014 ISSN ; CODEN: SINTE 8 131

Scalable Transactions for Web Applications in the Cloud using Customized CloudTPS

Lecture 20: Emitter Follower and Differential Amplifiers

Continuity Cloud Virtual Firewall Guide

Product Overview. Version 1-12/14

Question 3: How do you find the relative extrema of a function?

The international Internet site of the geoviticulture MCC system Le site Internet international du système CCM géoviticole

CARE QUALITY COMMISSION ESSENTIAL STANDARDS OF QUALITY AND SAFETY. Outcome 10 Regulation 11 Safety and Suitability of Premises

Keynote Speech Collaborative Web Services and Peer-to-Peer Grids

June Enprise Rent. Enprise Author: Document Version: Product: Product Version: SAP Version:

Enforcing Fine-grained Authorization Policies for Java Mobile Agents

Lecture 3: Diffusion: Fick s first law

Development of Financial Management Reporting in MPLS

Review and Analysis of Cloud Computing Quality of Experience

Planning and Managing Copper Cable Maintenance through Cost- Benefit Modeling

AP Calculus AB 2008 Scoring Guidelines

ITIL & Service Predictability/Modeling Plexent

Constraint-Based Analysis of Gene Deletion in a Metabolic Network

IHE IT Infrastructure (ITI) Technical Framework Supplement. Cross-Enterprise Document Workflow (XDW) Trial Implementation

IBM Healthcare Home Care Monitoring

A Loadable Task Execution Recorder for Hierarchical Scheduling in Linux

Econ 371: Answer Key for Problem Set 1 (Chapter 12-13)

Hardware Modules of the RSA Algorithm

Gold versus stock investment: An econometric analysis

Parallel and Distributed Programming. Performance Metrics

Remember you can apply online. It s quick and easy. Go to Title. Forename(s) Surname. Sex. Male Date of birth D

Combinatorial Analysis of Network Security

Entity-Relationship Model

STATEMENT OF INSOLVENCY PRACTICE 3.2

Category 1: Purchased Goods and Services

Fleet vehicles opportunities for carbon management

Nimble Storage Exchange ,000-Mailbox Resiliency Storage Solution

A Secure Web Services for Location Based Services in Wireless Networks*

A Multi-Heuristic GA for Schedule Repair in Precast Plant Production

SPECIAL VOWEL SOUNDS

REPORT' Meeting Date: April 19,201 2 Audit Committee

ESA Support to ESTB Users

Real-Time Evaluation of Campaign Performance

Rural and Remote Broadband Access: Issues and Solutions in Australia

Use a high-level conceptual data model (ER Model). Identify objects of interest (entities) and relationships between these objects

Mathematics. Mathematics 3. hsn.uk.net. Higher HSN23000

EVALUATING EFFICIENCY OF SERVICE SUPPLY CHAIN USING DEA (CASE STUDY: AIR AGENCY)

CPS 220 Theory of Computation REGULAR LANGUAGES. Regular expressions

Sharp bounds for Sándor mean in terms of arithmetic, geometric and harmonic means

Probabilistic maintenance and asset management on moveable storm surge barriers

Meerkats: A Power-Aware, Self-Managing Wireless Camera Network for Wide Area Monitoring

(Analytic Formula for the European Normal Black Scholes Formula)

Why An Event App... Before You Start... Try A Few Apps... Event Management Features... Generate Revenue... Vendors & Questions to Ask...

Upper Bounding the Price of Anarchy in Atomic Splittable Selfish Routing

A Theoretical Model of Public Response to the Homeland Security Advisory System

union scholars program APPLICATION DEADLINE: FEBRUARY 28 YOU CAN CHANGE THE WORLD... AND EARN MONEY FOR COLLEGE AT THE SAME TIME!

User-Perceived Quality of Service in Hybrid Broadcast and Telecommunication Networks

LG has introduced the NeON 2, with newly developed Cello Technology which improves performance and reliability. Up to 320W 300W

Precise Memory Leak Detection for Java Software Using Container Profiling

ONLINE CONSUMER BEHAVIOR: AN EXPLORATORY STUDY

I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance

An IAC Approach for Detecting Profile Cloning in Online Social Networks

Voice Biometrics: How does it work? Konstantin Simonchik

Usability Test of UCRS e-learning DVD

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research)

Teaching Computer Networking with the Help of Personal Computer Networks

GOAL SETTING AND PERSONAL MISSION STATEMENT

Whole Systems Approach to CO 2 Capture, Transport and Storage

Version 1.0. General Certificate of Education (A-level) January Mathematics MPC3. (Specification 6360) Pure Core 3. Final.

The Constrained Ski-Rental Problem and its Application to Online Cloud Cost Optimization

Asset set Liability Management for

Free ACA SOLUTION (IRS 1094&1095 Reporting)

TIME MANAGEMENT. 1 The Process for Effective Time Management 2 Barriers to Time Management 3 SMART Goals 4 The POWER Model e. Section 1.

Global Sourcing: lessons from lean companies to improve supply chain performances

FEASIBILITY STUDY OF JUST IN TIME INVENTORY MANAGEMENT ON CONSTRUCTION PROJECT

A Note on Approximating. the Normal Distribution Function

Online Price Competition within and between Heterogeneous Retailer Groups

SCHOOLS' PPP : PROJECT MANAGEMENT

UNIVERSITY OF NAIROBI SCHOOL OF COMPUTING & INFORMATICS IMPROVING APPLICATION OF KNOWLEDGE MANAGEMENT SYSTEMS IN ORGANIZATIONS:

Designing a Secure DNS Architecture

CalOHI Content Management System Review

Performance Evaluation

An Adaptive Clustering MAP Algorithm to Filter Speckle in Multilook SAR Images

Traffic Flow Analysis (2)

a m e s y s AMESYS INTELLIGENCE SOLUTIONS C RITIC A L SYSTEM ARCHITEC T SERVICES PROVIDED C O N T A C T S

Transcription:

Intrnational Journal of Computr Scinc and Enginring Opn Accss Rsarch Papr Volum-2, Issu-1 E-ISSN: 2347-2693 Analyzing Failurs of a Smi-Structurd Suprcomputr Log Fil Efficintly by Using PIG on Hadoop Madhuri Srinivas Pall 1*, Konisa Jyothsna 2 & B. Anusha 3 1* Computr Scinc Enginring, National Institut of Tchnology, Warangal, India, madhuri.pall@yahoo.com 2 Computr Scinc Enginring, National Institut of Tchnology, Warangal, India, jyothsna1503@gmail.com 3 Computr Scinc Enginring, National Institut of Tchnology, Warangal, India, anushabattula321@gmail.com www.ijcaonlin.org Rcivd: 12/12/2013 Rvisd: 20/12/2013 Accptd: 28/12/2013 Publishd: 31/Jan/2014 Abstract Data sts usd to ful th rcntly popular concpt of businss intllignc ar bcoming incrasingly larg. Convntional databas managmnt softwar is no longr fficint nough howvr; paralll databas managmnt systms and massiv data-scal procssing systms lik MapRduc indd look promising. Although, MapRduc is a good option, it is difficult to work with, as th programmr would hav to think at th mappr and rducr lvl. In this papr, w prsnt a simpl yt fficint way to min usful information whr a program can b writtn as a sris of stps. W hav qurid a suprcomputr log fil using Apach s Hadoop and PIG, obtaind rsults as to whn and why th suprcomputr had faild and compard ths rsults to that of a traditional program. Kywords- Big Data, Paralll Procssing, Hadoop, MapRduc, Data Mining, Businss Intllignc, PIG, Log fil analysis, Suprcomputr I. INTRODUCTION Nowadays, businss intllignc has bcom a popular trnd. What usd to b trial and rror has now dvlopd into dcisions basd on past xprincs and othr data. Th fficacy of mining usful information is vr changing and hnc, mor and mor mthods to mak mining fficint ar bing proposd. In this papr, w will propos a mthod to analyz th log fil of a suprcomputr from th bio-informatics background. Log fils of this suprcomputr ar not only vast, but ar also smi-structurd. Th mthod proposd dos not only prov to b mor fficint than th ons alrady xistnt, but also happns to b much simplr. On of th most fficint solutions to vast log fil analysis is using a paralll procssing systm. Howvr, without using a qurying languag, th programmr will nd to hav indpth knowldg about MapRduc. In many situations, th programmr ithr dos not hav such xprinc or would lik a solution in which much training is not rquird. Also, rsultant data is usually rquird within a short intrval as most ral-tim data workd with has high vlocity, which rquirs programmrs to provid instant analysis. W prsnt a solution that allows popl to b abl to min information fficintly on a paralll procssing systm without rquiring much training or background knowldg. As a PIG program is nothing but a sris of stps, or in othr words stpwis instructions, it asy to us, kping in mind th gnral population. W hav implmntd this qurying languag into our solution for analyzing th failurs of a suprcomputr. Corrsponding Author: Madhuri Srinivas Pall Th objctiv of th papr is to analyz th failurs of a suprcomputr by mining its log fil in two ways and to compar th rsults of both th mthods. Th first way is a traditional program writtn in Java that will rad th fil and output th rquird, rlvant information. Th scond mthod is to stor th fils in th HDFS (Hadoop Distributd Fil Systm), writ a PIG qury on MapRduc that will giv th smi-structurd log som schma to work with and output th rquird rsults back to th HDFS. Th xact implmntation has bn xplaind in th rspctiv sction of th papr. By mining both ways, th tim takn by both th mthods for diffrnt sizs of data was compard to s which mthod is mor fficint and fasibl. As hypothsizd, th mthod involving th usag of MapRduc and th PIG qurying languag has provn to b much fastr. Statistics hav bn prsntd in th rsults sction of th papr. In th nxt sction, w hav writtn th background rlvant to our rsarch. This contains a brif dscription about MapRduc and PIG. Sction III and IV contain th nd of th problm statmnt and th implmntation rspctfully. Rsults hav bn portrayd in th Sction V and sction VI contains th conclusion and scop for furthr rsarch. Th papr nds with a list of rfrncs usd for th rsarch of our papr. 2014, IJCSE All Rights Rsrvd 1 II. BACKGROUND Big Data, as pr th Gartnr dfinition, rfrs to data that has on or mor than on of th following thr assts; volum, vlocity and varity. Th analysis of such data is known to b usful in businss intllignc and dcision-

making. Ovr th yars, data mining has bn a hot trnd all around th world. Data analysts min larg amounts of data to spot som kind or roccurring trnds that may just provid a comptitiv dg. As mntiond abov, data can b of diffrnt formats. In ordr to analyz smi-structurd and unstructurd data fficintly, w can us Hadoop. Hadoop is an opn sourc softwar framwork. Its architctur can b broadly classifid into two parts; th Hadoop Distributd Fil systm (HDFS), which is it s storag mchanism and th Hadoop distributd computational mchanism, which is popularly known as MapRduc. On a fully configurd Hadoop clustr, thr ar fiv running damons; th namnod, th datanod, th jobtrackr, th tasktrackr and th scondary namnod. Th namnod and datanod hav a mastr-slav rlationship. Th namnod is th mastr, contains fil s mtadata and kps track of which blocks of data go to which nod. Th datanods do th grunt work of rading and writing HDFS data blocks to fil systm. Th scondary namnod taks scrnshots of th namnod at rgular intrvals for failur rcovry purposs. Th jobtrackr and tasktrackr hav rlationship similar to that of th namnod and datanod. Whil th jobtrackr monitors all th tasks, th tasktrackr is rsponsibl for individual tasks. Th ovrall topology is dpictd in th following diagram. Th MapRduc modl s data flow contains two main stps; map and rduc. In brif, th Map stp consists of th mastr nod diving th data input into sub-problms and distributing ths problms to th working nods. Th working nods can furthr sub-divid thir own problm and distributu it amongst thir own working nods in th form of a multi-lvl tr. At any itrativ lvl, th working nod computs th rquird output and rports back to it's mastr. Th Rduc stp is somwhat lik a summary stp. It involvs th collction of all th subproblm outputs, combins thm and producs th originally rquird answr. Togthr, thy form th framwork of a distributd systm. PIG is an xtnsion of Hadoop which is usd to simplify th unncssary complxity of MapRduc. It contains two main componnts; a high-lvl languag PIG Latin and a compilr which is usually Hadoop. Complx tasks can b xplicitly writtn as data flow squncs and hnc maks it asir to writ and maintain any particular program. Th usr is givn th opportunity to concntrat on th smantics of th program rathr than optimization as this is automatically don whil using PIG. Pig quris can b writtn in thr ways. Th first way is by using an intractiv grunt shll. Th scond way is to writ a script fil, which is usually for larg, rptitiv programs and th third way is mbdding th quris in a java program. Also, it runs on two diffrnt mods, th local and th Hadoop mod. In this papr, w ar using PIG in th maprduc mod. Ways of Running Pig Latin Script Mthod Grunt intractiv shll Script fil Embddd quris in java program Gnrally usd for ad hoc data Manually ntr lin by lin Usd for larg, rptitiv pig programs Format: pig myscript.pig Pig srvr class allows any Java program to xcut qury A sampl qury in PIG is givn blow(fig.2.1). Th LOAD command will load th fil path as pr th dfind schma using th dlimitr spac. Hr th data typ takn is th dfault chararray. Th alias grp contains th filtrd tupls according to th condition spcifid. In this qury, pattrn matching was usd to chck th dat of th tupl. Th LIMIT oprator limits th total numbr of tupls to 10 for convninc sak and th alias cntd rprsnts ths 10 tupls. To s what is happning bhind th scns, PIG provids thr diagnostic oprators ILLUSTRATE, EXPLAIN and DESCRIBE. Fig.2.1 Dscription grunt> log= LOAD /usr/mady/projct.log USING PigStorag( ) AS (month,day,tim,info1,info2,info3); grunt> grp= FILTER log BY (day matchs.*30.* ); grunt> cntd= LIMIT grp 10; grunt> ILLUSTRATE cntd; 2014, IJCSE All Rights Rsrvd 2

Each sparat command can b distinguishd by th grunt> shll indication. Th scrnshot blow(fig.2.2) shows how a fw of th columns of th schma chang by using th ILLUSTRATE command. and a mssag. W hav takn ach pic of information as a column by dfining a schma. Fig.4.1 Fig.2.2 On of th advantags of PIG ovr SQL is that it has a loosr schma approach than that of SQL, which thrfor, maks it mor suitabl for smi-structurd and unstructurd data. III. IMPORTANCE OF THE STUDY PROBLEM STATEMENT Analysis of big data has bcom of utmost priority today. Spcifically, analysis of log fils is rquird in ordr to gaug and undrstand th failurs of th suprcomputr. By doing so, futur failurs and losss can b prvntd. In this papr, w hav proposd an fficint solution for analyzing th log fils of a suprcomputr compard to that of a traditional program. W hav also takn into considration th fact that popl attmpting to analyz ths fils may not hav much xprinc and knowldg in th MapRduc domain. Th solution proposd in this papr is important as it can b usd b a larg population, not just tchnical xprts. W hav targtd a mor divrs population as potntial usrs of this proposd solution. Also, this is an apt solution for not just structurd logs, but smi-structurd logs as wll. Using pig w can comput on distributd nvironmnt so how much vr is th siz of th input w can comput th rsults. IV. IMPLEMENTATION W hav usd PIG on Hadoop vrsion 1.0.4. Firstly, w hav takn a sampl smi-structurd suprcomputr log fil, a part of which is shown blow(fig.4.1). In this fil, thr is information lik th dat, tim, krnl in qustion Th log fil was copid to th HDFS. Whil writing th qury, w sarchd for all logs in which th suprcomputr had faild. Th qury was xcutd in th MapRduc mod of PIG and was tstd on th intractiv grunt shll as wll as by saving it as a script fil. Aftr MapRduc functions wr prformd, ths logs wr stord in a output fil in th HDFS for viwing. W had also prformd quris for usr-dfind quris such as which activitis wr going on at a particular usr-dfind tim or on a usr-dfind dat. A scrnshot of th statistics for a sampl fil ar shown blow(fig.4.2). Fig.4.2 As sn in th statistics scrnshot, various masurs hav bn displayd including th job_id, maps, rducs, maximum map tim, minimum map tim, avrag map tim, mdian map tim, maximum rduc tim, minimum rduc tim, avrag rduc tim, mdian rduc tim, aliass and fatur outputs. Also, various countrs hav bn displayd such as th total numbr of rcords writtn. 2014, IJCSE All Rights Rsrvd 3

To viw th output rcords of th qury, w accss th HDFS. Blow(fig.4.3) is th output of th sampl log fil. Blow (tabl 5.2) ar th valus takn to draw th graph. In th Hadoop mthod, a fw initial valus ar unstabl. This is probably du to th diffrnc in numbr of maps allottd, th Hadoop nvironmnt and numbr of rsourcs availabl. Th traditional mthod shows no such inconsistncy. Fig.4.3 In this papr, w hav dtrmind th fficincy of using PIG compard to that of a traditional Java program by taking diffrnt sizd log fils and analyzing thm both ways. Each tim, w doubld th siz of th log fil in ordr to gt an xact pictur as to which mthod is suitabl for which kind of fil (i.. small, mdium, larg). Th rsults of th prformanc chck ar givn in th nxt sction of th papr. V. RESULT & ANALYSIS On taking diffrnt sizd fils, w hav obsrvd that by using Hadoop and PIG, log fils can b analyzd much mor fficintly compard to that of a traditional Java program. Not only dos th tim takn for th fil to b analyzd rduc considrably by using this mthod, but th complxity of programming with this mthod has also provn to b much mor usr-frindly. As Hadoop s architctur uss paralll procssing, for largr fils it is much mor fasibl. That is, to procss largr fils, mor maps can b usd in ordr to maintain fficincy. Th graph shown in th nxt column (fig.5.1) was mad by rcording th tim takn by diffrnt sizd fils to hav thir logs procssd in millisconds. This was don for both th mthods in our comparison. s ) d n c o ilis ( m tim n tio c u x E 90 000 80 000 70 000 60 000 50 000 40 000 30 000 20 000 10 000 0 analyzing compltly unstructurd data. P r fo r m an c C o m p a ri si o n 0 5 0 0 1 0 0 0 1 5 0 0 Fig.5.1 Siz (mgabyt) Traditional mthod (milli sc) Hadoop (milli sc) 0.093 49 12 0.18 70 63 0.37 79 49 0.74 106 50 1.5 147 58 2.9 328 102 5.8 506 57 11.6 901 60 23.3 2093 63 46.6 3458 77 93 8126 80 188 18920 133 376 22326 324 752 36206 437 1504 76638 807 VI. Tabl 5.2 CONCLUSION According to xprimntal rsults, th mthod involving using th qurying languag PIG on Hadoop has provn to b scalabl, rliabl,fastr,and fficint. On of th major advantags bsids fficincy is th fact that it is simpl and asy to us, hnc targting a widr audinc. Th mthod rliably analyzs th log fils, and hnc provs that it is usful for structurd as wll as smi-structurd data. Extnsions of this papr could includ a solution for REFERENCES [1]. T. Whit, Hadoop: Th Dfinitiv Guid. Yahoo Prss,2010. [2]. Chuck Lam,Pig:Hadoop in Action. [3]. J. Dan and S. Ghmawat, Maprduc: Simplifid Data Procssing on Larg Clustrs, Comm. of th ACM,Vol. 51, no. 1, pp. 107 113, 2008. [4]. C. Olston, B. Rd, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin: A Not-So-Forign Languag for Data Procssing, Proc. of th 2008 ACM SIGMOD intrnational confrncon Managmnt of Data, 2008, pp. 1099 1110. [5]. Thomas Ridmistr, Mohammad Ahmad Munawar, Miao Jiang, Paul A.S.Ward, "Diagnosis of Rcurrnt 2014, IJCSE All Rights Rsrvd 4

Faults using Log Fils," Proc. of th 2009 Confrnc of th Cntr for Advancd Studis on Collaborativ Rsarch,Novmbr 2009, pp. 12-23. [6]. Apach. Hadoop: Opn-sourc implmntation of MapRduc. http://hadoop.apach.org. [7]. Apach. Pig: High-lvl data ow systm for Hadoop. http://www.pig.apach.org [8]. Michal Cardosa, Chnyu Wang, Anshuman Nangia, Abhishk Chandra, Jon Wissman,"Exploring MapRduc fficincy with highly-distributd data" Proc. of th scond intrnational workshop on MapRduc and its applications",jun 2011, pp. 27-34. [9]. H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parkr, Map-rducmrg: simplifid rlational data procssing on larg clustrs, proc. of th SIGMOD Confrnc, 2007, pp. 1029 1040. [10]. A. Gats, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Rd, S. Srinivasan, and U. Srivastava., Building a High-Lvl Dataflow Systm on Top of Map-Rduc: Th Pig Exprinc. Proc. of th VLDB Endowmnt, vol. 2,no. 2, 2009. [11]. A tutorial on pig: http://www.pig-tutorial.blogspot.in/ 2014, IJCSE All Rights Rsrvd 5