Getting help - guide to the ticketing system Thomas Röblitz, UiO/USIT/UAV/ITF/FI ;)
World is perfect, isn t? Why do we need a ticket system?
Example 1 User: I am having trouble logging on to abel this morning. My username is, and I have access to a home area and the project area. Do you know if this is a general problem, and whether it will be solved soon? Very good - Good - Bad
Example 1 User: I am having trouble logging on to abel this morning. My username is, and I have access to a home area and the project area. Do you know if this is a general problem, and whether it will be solved soon? Very good - Good - Bad
Example 1 User: I am having trouble logging on to abel this morning. My username is, and I have access to a home area and the project area. Do you know if this is a general problem, and whether it will be solved soon? Support: Very good - Good - Bad Hei, do you have a little more information? E.g., some error message... User: -bash: warning: setlocale: LC_CTYPE: cannot change locale (UTF-8) and then I do not get a new bash prompt.
Example 1 User: -bash: warning: setlocale: LC_CTYPE: cannot change locale (UTF-8) and then I do not get a new bash prompt. Support: it might be a problem with some environment variable on your computer. Please try the following export LC_CTYPE=en_US.UTF-8 Then, try to login again User: It worked! Thanks a lot! Timeline: user (0), support (+7 min), user (+6 min), support (+68 min), user (+45 min) => ~ 2 h
Example 2 User: I am experiencing severe problems in using XY and compiling files on Abel. I had a working setup on Titan but now it seems impossible to source and run XY on Abel. Very good - Good - Bad
Example 2 User: I am experiencing severe problems in using XY and compiling files on Abel. I had a working setup on Titan but now it seems impossible to source and run XY on Abel. Very good - Good - Bad Support: What were you actually trying? E.g., which commands did you run, what is their output. How do you identify the problem? User: never responded! Timeline: user (0), support (+30 min), closed (+10 days)
Goals understand how we process tickets provide guidelines for the interactions, information needed by the support team get the help needed
Abel 875 users R1 R2 Compute nodes Intel SNB 2x 8 core IB...... Mellanox FDR Infiniband 56 Gbps core support team: ~10 10K cores 300K jobs per month ~130* sw packages RN Data (NorStore/ UiOStore/Local)... FhGFS global parallel filsystem 400 TiB hugemem nodes (1 TiB) Scratch /home Data IB + GbE...... SLURM resource manager frontends / mgmnt login + portals tasks: tickets, devel, projects several other units involved RT system new par FS GPU special purpose nodes GPU nodes nvidia GPU IO compute nodes cloud grid ~6K tickets 8Y 800 for Abel
Tickets stats ~ 6K HPC tickets, ~ 800 since Abel ~ 3 days (median) to process a ticket
How do we process a ticket?
How do we process a ticket? known/easy issue -new user -reset password -program not found -recently solved issue -... usually short time to process
How do we process a ticket? unknown/complex issue -what parts of Abel are involved -what actually happens/ed to provide a (good) solution we usually want to reproduce what the user (not) sees can be quite long procedure
Reproducing the problem Trying to run a minimal sequence of commands that leads to the problem Verify that the problem exists Understand the problem (better) Adapt environment, fix sw pkgs, change parameters to provide a solution Test with sequence!
Guidelines Is it a simple (UNIX) or generic issue? (1) Google, books; (2) colleagues; (3) houston HPC (Abel) specific Did you check our documentation? issue a ticket
Information for a new ticket some observations - often too few, too imprecise information - (long) procedure to figure out what is the core of the issue remember: what, when, where, who,...
Information for a new ticket What? - try to be as precise and expressive as possible (you not always can though) - - run commands with tool script to generate a sequence of commands & outputs that leads to the problem either the root cause becomes obvious OR the issue can be reproduced
Information for a new ticket Where? - which Abel machine: login, compute, appnode, bioportal - - which remote machine: eg when logging in, what operating system (Win, OS X, Linux) which path, file: HOME/myrun/..., PROJECTS/..., WORK/...
Information for a new ticket When? - which day, time: yesterday, this morning, now,... - which job: job id(s)
Information for a new ticket Who? - myself - myself + my colleague(s) {whom precisely?}
Information for a new ticket Other infos / recommendations - - - known previous/similar issue (refer to ticket with URL/id) try to limit the scope of a single ticket do not reopen a resolved ticket with a follow-up issue
Interacting with hpc-drift Do not address a specific member of the support team (unless you know the one is the only one who can help...) Help them with providing additional information (sometimes timing is critical) Please, no URGENT,!!!,??? ;) Try to not blame other users of wrongdoing (nobody is perfect)
Resolved? We are there to help you. Do not hesitate to ask! Remember that we need your help too, to provide good solutions in reasonable time... to be pointed to issues we may overlook...