Batch Job Analysis to Improve the Success Rate in HPC

Batch Job Analysis to Improve the Success Rate in HPC 1 JunWeon Yoon, 2 TaeYoung Hong, 3 ChanYeol Park, 4 HeonChang Yu 1, First Author KISTI and Korea University, jwyoon@kisti.re.kr 2,3, KISTI,tyhong@kisti.re.kr,chan@kisti.re.kr * 4 Corresponding Author Korea University, yuhc@korea.ac.kr Abstract Tachyon is a supercomputer built by SUN(Oracle) at KISTI for use in a variety of science applications. It composed of 3,200 computing nodes and infra-facilities. In addition, this machine works with various software such as file system, compiler, debugger, parallel tools, etc. Among them, Tachyon uses, as a scheduler, Sun Grid Engine(SGE) to carry out the batch job of the cluster. It performs at a theoretical peak of 300 teraflops. In this paper, we analyzed the batch job logs which include information about the history of operations performed on SGE. In particular, we focused to distinguish the failed job log to find the cause of failure. Additionally, this failed log is separated by a problem from the user s actions and system errors. For starters, we can check a validity of the failure job log itself and some of case can be blocked in advance. By doing so, users can recognize the problem immediately without having to wait unnecessarily until the order. In the view of scheduler, it can reduce overall waiting time for users. 1. Introduction Keywords: HPC, Supercomputer, Batch job, Scheduler, SGE Tachyon is a high-performance parallel computing system which constructed based on SUN Blade 6275. It composed of 3,200 computing nodes and infrastructures such as login nodes, scheduler server, storage, archiving system etc. And this system is operating over the RedHat 5.3 and using Lustre [1] as a filesystem which are connected by InfiniBand [2]. Figure 1 shows a brief of Tachyon System [3]. Also, Tachyon uses Sun Grid Engine (hereinafter referred to as SGE ) as a distributed resource management scheduler [4]. When user submits a job to cluster, the qmaster in SGE sorts the jobs in order of priority [5]. Job priority is derived from scheduler polices [6]. The sorted list of pending jobs is assigned to job slots (CPU) in order of priority [7]. In this paper, we distinguished job logs from SGE Exit-code that contained information why jobs are stopped abnormally. Additionally, we separate the reason of ended job using Exit-code value. And then we can pick out the cause of failure and prevent the some jobs to fail in advance. Figure 1. Summary of Tachyon system Journal of Next Generation Information Technology(JNIT) Volume 4, Number 8, October 2013 162

2. Batch Job Processing 2.1. Job execution log on SGE Simply, scheduler manages and controls batch job in order to share limited resources [8]. Many of those products are developed such as SGE, Torque, PBS, LoadLeveler and so on. As noted above, Tachyon system uses SGE for batch job scheduler. Figure 2 shows an overall composition of queue, slot, host and group on SGE. User can submit a lot of jobs without having to worry where to execute [9]. Some hosts can consist of host group as needed. Job slot is a minimum resource unit and regard as a CPU (core). Queues have job slots. The sorted list of pending jobs is assigned to job slots in order of priority [10]. Also, SGE provides execute information of jobs in various command. For one thing, using like this command qacct j #Job_ID can check logs of execution after the job ended like a Figure 3. In this log contains basic executed data such as queue type, job id, queue submit time, start time, end time, executed job status, etc. Figure 2. Host Group, Queue and Job Slots 2.2. Analysis of job log Figure 3. Executed job information from SGE In this paper, we analyzed job logs for year of 2012 as part of the all data. Based on in Figure 3 above, each executed job information stores again like as Table 1 after the job is completed in Tachyon. This data contain not only how job executed but also why job terminated. Simply, it is called Converted-log. Especially, the last two fields (No. 19, 20 of Table1) of Converted-log shows whether the job finished normally or not. The cause of the abnormally ended job is quite diverse such as forced termination by user, job submit script error, exceeds the wall time limit of job scheduler [11], SW and HW troubles, etc. Each reason has a code value like seeing Table 2. Table 1. Information of executed job log No. Property Value(example) No. Property Value(example) 1 DATE 20121029 11 RUN(s) 35852 2 JOBID 1310609 12 CPUS 80 3 GID na0*** 13 CPU USAGE 2834282 4 UID r000*** 14 MEM USAGE 68 5 JOBNAME jj_007_*** 15 MAXVMEM 1876 6 QNAME ocean4special 16 STATUS D 7 SUBMIT(DATE) 20121026111256 17 E-CPU 80 8 START(DATE) 20121028191152 18 E-RUN(s) 2868160 9 END(DATE) 20121029050924 19 EXIT-CODE 0(11) 10 WAIT(s) 201536 20 FAILED 0(11) 163

Table 2. qacct -j failed Field Codes (from SGE) Code Description Meaning for Job 0 No failure Job ran, exited normally 1~11 Presumably before job, Before writing config, Before writing PID, On reading config file, Setting processor set, Before prolog, In Job could not be started prolog, Before prestart, In prestart, Before job 12 Before pestop Job ran, failed before calling PE stop procedure 13 In pestop Job ran, PE stop procedure failed 14 Before epilog Job ran, failed before calling epilog script 15 In epilog Job ran, failed in epilog script 16 Releasing processor set Job ran, processor set could not be released 24 Migrating (checkpointing jobs) Job ran, job will be migrated 25 Rescheduling Job ran, job will be rescheduled 26 Opening output file Job could not be started, stderr/stdout file could not be opened 27 Searching requested shell Job could not be started, shell not found 28 Changing to working directory Job could not be started, error changing to start directory 100 Assumedly after job Job ran, job killed by a signal If the code is 0, the operation ended normally. Using this converted log, we extracted exit codes and counted the number of each failure type. Figure 4 is a script for arrangement of exit codes. Additionally, we categorized the cause according to the failure type and classified factors as below Table 4 which just shows major part from all of the failed code in Tachyon system. Some failure reasons hard to track down. For jobs that run successfully, the qacct -j command output shows a value of 0 in the failed field, and the output shows the exit status of the job in the Exit_status field. In such cases, the failed field displays one of the code values listed in Table2. codefile=$(cat 2012*.sge awk '{ print $2,$19,$20 }' > sge.exit.code.${today}) # Extract job_id and error code from the convert log and create temporary file. cat sge.exit.code.${today} awk '{ print $2,$1 }' awk '{ print $2 }' > sge.exit.code.errors.jobid.${today} cat sge.exit.code.${today} awk '{ print $2,$1 }' egrep ", ^[1-9]" awk '{ print $2 }' > ${tmpdir}/sge.exit.code.errors.jobid.${today} # Extract Failed job (non-zero values) errorjobnum=$(wc -l sge.exit.code.errors.jobid.${today} awk '{ print $1 }') # the number of failed jobs totaljobnum=$(wc -l sge.exit.code.${today} awk '{ print $1 }') # total jobs successjobnum='expr $totaljobnum - $errorjobnum' # the number of success jobs errrate=$(awk -v x=$errorjobnum -v y=$totaljobnum 'BEGIN { printf "%.3f\n", x/y*100 }') succrate=$(awk -v z=$errrate 'BEGIN { printf "%.3f\n", 100-z }') echo "Number of Total Jobs = $totaljobnum, Number of Sucess Jobs = $successjobnum, Number of Failed Jobs = $errorjobnum" > error_rate_${datamonth}.log echo "Failed Rate = ($errorjobnum/$totaljobnum) * 100" >> $tmpdir/error_rate_${datamonth}.log Total Job Number = 564746, Failed Job Number = 104182 Failed Rate = (104182/564746) * 100 Job Failed Rate : 18.448 (81.552) % Figure 4. A script to separate reasons using failed codes (in 2012) 164

Table 3. Failed Codes from Tachyon System Value Reason Times Rate 0(*),100(*) or 100(*) 0(*),13(*) or 13(*) Job ran, job killed by a signal 72,013 12.7% Job ran, PE stop procedure failed 18,673 3.3% 27(*) Job could not be started, shell not found 495 0.1% 0(*),28(*) or 28(*) Job could not be started, error changing to start directory 458 0.1% 3. Strategies With the result of experiment, we can take steps for each failed code. As referred to earlier, these codes contain reasons why job become failed. There are some cases such as job submit-script error, wrong command, floating point exception, reference of invalid memory address, etc. After analyzing failed code of converted log, some types of failed code can prevent errors from user jobs in advance. In order to avoid errors of this user, we can modify pre and post processing of the prolog and epilog in SGE. Prolog and epilog are global scripts that can be invoked before and after any job. Suppose, as mentioned above, job submit-script error is one of the basic reasons to make a fail such as exceed a time limit and maximum number of jobs submitted, wrong command, parallel environment errors something like that. In this case, we can filter the error beforehand using prolog script without waiting until the job run. Figure 5 is a prolog script which filters out wrong condition of job submitscript. Plus, although some job successfully completed, but this result left failed log. In such a case, we modified epilog script so that it can be accurately reflected. normal ) # Quene name # If job condition does not fit to queue properties, warning and delete a job if [ $total_slots -lt 17 ]; then # property of minimum number of job echo "[ERROR] The CPU number of your job should be greater than or equal to 17 ${SSH} $SGE_O_HOST "echo '[ERROR] The CPU number of your job should range from 1 to 960 in exclusive2 queue.' write $USER $SSH_TTY" ${QDEL} $JOB_ID exit 1 elif [ $total_slots -gt 1568 ]; then # Property of maximum number of job echo "[ERROR] The CPU number of your job should be less than or equal to 1536 ${QDEL} $JOB_ID exit 1 fi if ( isdigit $hrt ) && [ $hrt -gt 48 ] ; then # Property of wall time limit(h_rt) echo "[ERROR] The wall time limit (h_rt) for normal queue is 48:00:00" ${SSH} $SGE_O_HOST "echo '[ERROR] The wall time limit (h_rt) for normal queue is 48:00:00' write $USER $SSH_TTY" ${QDEL} $JOB_ID exit 1 fi ;; Figure 5. Prolog script for prevention of user s job script errors 165

4. Simulation Through above steps, we carried out a simulation. In fact, several failed codes are not clear to distinguish the reason. As said, we fixed a prolog/epilog script and modified reflection of wrong job execution information. Figure 6 shows a status of job execution rate in 2012. First, we applied to failed codes about 13, 27, and 28. Figure 7 shows the simulation result after pre and post processing. And then, we can prevent some errors in advance. From the result showed in Figure 7, average of job success rate improves from 77.0% to 80.2% Figure 6. Job execution rate Figure 7. Fixed job execution rate 5. Conclusion and Future work In this paper, we analyzed actual logs of job execution using converted log in Tachyon system. For this work, we had to ferret out the reason through unpack the failed code from SGE. Those codes implies why job was ended abnormally to express the result numerically. Some of failed case can be blocked in advance using pre and post processing. So, we fixed prolog and epilog script in SGE. As a 166

result, users can find out the problem of job script before the job executes and we can reduce overall job waiting time. As I referred, some failed codes difficult to learn to cause of fail. Future step, we need to more researches about the relationship between failed code and signal from Linux. We can improve success rate of job. 6. References [1] F. Wang, S. Oral, G. Shipman, O. Drokin, T. Wang, and I. Huang, Understanding lustre filesystem internals, Oak Ridge National Lab, Technical Report ORNL/TM-2009/117, 2009. [2] G. Pfister, An Introduction to the InfiniBand Architecture (http://www.infinibandta.org/), IEEE Press, 2001. [3] National Institute of Supercomputing and Networking, KISTI, http://www.nisn.re.kr/ [4] Templeton, D., A Beginner s Guide to Sun Grid Engine 6.2, Whitepaper of Sun Microsystems, July 2009. [5] C. Chaubal, "Scheduler Policies for Job Prioritization in the Sun N1 Grid Engine 6 System", Technical report, Sun BluePrints Online, Sun Microsystems, Inc., Santa Clara, CA, USA. http://www.sun.com/blueprints/1005/819-4325.pdf, 2005. [6] J.H. Abawajy, An efficient adaptive scheduling policy for high-performance computing, Original Research Article Future Generation Computer Systems, Volume 25, Issue 3, pp.364-370, Mar 2009. [7] G. Cawood, T. Seed, R. Abrol, T. Sloan, "TGO & JOSH: Grid Scheduling with Grid Engine & Globus", Proceedings of the UK e-science All Hands Meetings, Nottingham, 2004. [8] Stillwell, M.; Vivien, F.; Casanova, H., "Dynamic Fractional Resource Scheduling versus Batch Scheduling," Parallel and Distributed Systems, IEEE Transactions, vol.23, no.3, pp.521-529, March 2012. [9] S. IqbalL, R. Gupta, Y.-C. Fang, Job Scheduling in HPC clusters, Dell Power Solutions, pp.133 136, 2005. [10] Stosser, J., Bodenbenner, P., See, S., Neumann, D., A Discriminatory Pay-as-Bid Mechanism for Efficient Scheduling in the Sun N1 Grid Engine, Hawaii International Conference on System Sciences, Proceedings of the 41st Annual, pages 382, 2008. [11] Kumar, Rajath, Sathish Vadhiyar. "Identifying Quick Starters: Towards an Integrated Framework for Efficient Predictions of Queue Waiting Times of Batch Parallel Jobs." Job Scheduling Strategies for Parallel Processing. Springer Berlin Heidelberg, pp. 196-215, 2013. 167