November 13, 2014 Bryan Caron bryan.caron@mcgill.ca bryan.caron@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada
Outline Compute Canada News October Service Interruption De-Brief November GPFS Online Maintenance Scheduler Updates Software and User Environment Updates Training News 2
Compute Canada News Resource Allocation Competition 2015 RAC and Research Platforms & Portals (RPP) Application submission deadline: October 23 Results Announcement: December 2 Call for Researcher Visualizations 2D or 3D visualizations from any research area that has leveraged Compute Canada resources interactive or pre-recorded movies Contact us: guillimin@calculquebec.ca Stay Tuned for exciting Compute Canada announcements on December 2 3
Service Interruption De-Brief Guillimin Service Interruption: October 17-18 Scheduled outage due to a full ETS campus-wide power interruption for electrical maintenance Start: Friday October 17, Target End: Saturday October 18 storage system and network maintenance power outage (23h - 03h) All Guillimin services restored October 19 (afternoon) October 20: Write time-out errors observed on GSS nodes of GPFS cluster (/gs/scratch and /gs/project) Root cause: enclosure firmware & GPFS version mismatch October 21: halt of all user access to stop GPFS and update firmware on all enclosures Access re-opened evening of October 21 4
GPFS Maintenance GPFS Maintenance - November 6 online replacement of one faulty disk drive enclosure drawer validated live drawer replacement with testing under load on our most recent GPFS Storage System not yet in active production Impacted filesystem: /gs/project/ and /gs/scratch/ GPFS slowdown during the brief drawer replacement file integrity maintained during drawer replacement maximum performance and stability restored Note: /scratch auto-clean-up - re-scheduled to November 20 5
Scheduler Update In general improved overall stability and performance Recall: April 10 - qsub for job submission enabled November 15 - msub submissions will be disabled qsub provides better support for queue options and speed of submissions Please ensure any scripts are using the site Torque qsub instead of msub /opt/torque/x86_64/bin/qsub Torque equivalents of Moab commands Old: canceljob New: qdel Old: checkjob New: qstat -f Old: showq New: qstat 6
Software Update New Installations HDF5/1.8.13-intel OpenFOAM/2.2.2-GCC-OpenMPI SAS 9.3 (Licensed for McGill users only) Matlab MDCS version R2014b New Matlab MDCS integration scripts Version 2.0 Easier configuration of job parameters (ppn, pmem, gpus, etc.) Please see Parallel Matlab on Guillimin on our Documentation page for details (http://www.hpc.mcgill.ca/index.php/starthere) 7
Software Update Reminder: Guillimin Hadoop Cluster 10 nodes available for MapReduce / Hadoop workloads In progress updates: Major increase to storage pool size per node using GPFS Hadoop ecosystem component installation based upon Hortonworks Data Platform (HDP) please contact guillimin@calculquebec.ca for access Hadoop Talk @ Big Data Montréal - November 4 by Dan Mazur of McGill HPC / Calcul Québec http://www.bigdatamontreal.org/ 8
See Training at www.hpc.mcgill.ca for our full calendar of training and workshops for 2014 and to register all materials from previous workshops are available online suggestions for training in 2015? Please let us know! Upcoming: November 27 - Introduction to Xeon Phi December 4 - Introduction to GPU / CUDA December 11 - Introduction to Matlab Distributed Computing Server Recently Completed: November 6 - Advanced MPI October 23 - Introduction to OpenMP October 9 - Introduction to MPI Training News 9
User Feedback and Discussion Questions? Comments? We value your feedback. Guillimin Operational News for Users Status Pages http://www.hpc.mcgill.ca/index.php/guillimin-status http://serveurscq.computecanada.ca (all CQ systems) Follow us on Twitter http://twitter.com/mcgillhpc 10