MSU Tier 3 Usage and Troubleshooting. James Koll



Similar documents
Dynamic Slot Tutorial. Condor Project Computer Sciences Department University of Wisconsin-Madison

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Virtual server management: Top tips on managing storage in virtual server environments

Martinos Center Compute Clusters

Parallel Processing using the LOTUS cluster

UMass High Performance Computing Center

MONITORING PERFORMANCE IN WINDOWS 7

Windows Server Performance Monitoring

Fair Scheduler. Table of contents

HPC-Nutzer Informationsaustausch. The Workload Management System LSF

Chapter 2: Getting Started

Getting Started with SandStorm NoSQL Benchmark

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

SQL Server Performance Tuning and Optimization

Deploying and Optimizing SQL Server for Virtual Machines

GraySort on Apache Spark by Databricks

Performance Tuning and Optimizing SQL Databases 2016

Microsoft SQL Server: MS Performance Tuning and Optimization Digital

Google File System. Web and scalability

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

Top 10 reasons your ecommerce site will fail during peak periods

Managing your Domino Clusters

Cloud Server. Parallels. Key Features and Benefits. White Paper.

Data Compression in Blackbaud CRM Databases

University of Pennsylvania Department of Electrical and Systems Engineering Digital Audio Basics

Qsan Document - White Paper. Performance Monitor Case Studies

Maximizing SQL Server Virtualization Performance

Configuration Maximums VMware Infrastructure 3

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

ENTERPRISE INFRASTRUCTURE CONFIGURATION GUIDE

ERserver. iseries. Work management

NetApp Data Compression and Deduplication Deployment and Implementation Guide

Understanding Hadoop Performance on Lustre

5 Reasons Your Business Needs Network Monitoring

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Tool - 1: Health Center

WHITE PAPER Keeping Your SQL Server Databases Defragmented with Diskeeper

Techniques for implementing & running robust and reliable DB-centric Grid Applications

13 Managing Devices. Your computer is an assembly of many components from different manufacturers. LESSON OBJECTIVES

Basic ShadowProtect Troubleshooting

System Architecture. CS143: Disks and Files. Magnetic disk vs SSD. Structure of a Platter CPU. Disk Controller...

Azure VM Performance Considerations Running SQL Server

Tuning Tableau Server for High Performance

Load Testing Hyperion Applications Using Oracle Load Testing 9.1

Five Reasons Your Business Needs Network Monitoring

Matlab on a Supercomputer

Characterizing Task Usage Shapes in Google s Compute Clusters

Assignment # 1 (Cloud Computing Security)

Preparing a SQL Server for EmpowerID installation

Windows Performance Monitor Troubleshooting Guide

An objective comparison test of workload management systems

A Survey of Shared File Systems

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

LabStats 5 System Requirements

Understand Troubleshooting Methodology

Case Study I: A Database Service

Quick Start Guide.

White paper. QNAP Turbo NAS with SSD Cache

Product Review: James F. Koopmann Pine Horse, Inc. Quest Software s Foglight Performance Analysis for Oracle

Running a Workflow on a PowerCenter Grid

Big Fast Data Hadoop acceleration with Flash. June 2013

Maximizing VMware ESX Performance Through Defragmentation of Guest Systems. Presented by

How To Understand And Understand An Operating System In C Programming

VI Performance Monitoring

Outline. Failure Types

10 Tips for Optimizing the Performance of your Web Intelligence Reports. Jonathan Brown - SAP SESSION CODE: 0902

Apache Hadoop. Alexandru Costan

The Comeback of Batch Tuning

Load Testing Analysis Services Gerhard Brückl

EMC Unisphere for VMAX Database Storage Analyzer

Have both hardware and software. Want to hide the details from the programmer (user).

Performance White Paper

Analysis of VDI Storage Performance During Bootstorm

Understanding MySQL storage and clustering in QueueMetrics. Loway

Pcounter. Category Characteristics. Unified print room management Print policies and rules Product-based job processing Print queue management

Signiant Agent installation

Installing and Configuring Active Directory Agent

Deploying LoGS to analyze console logs on an IBM JS20

Using Parallel Computing to Run Multiple Jobs

Adaptive Resource Optimizer For Optimal High Performance Compute Resource Utilization

Operating Systems. and Windows

Performance And Scalability In Oracle9i And SQL Server 2000

WINDOWS SERVER MONITORING

IMPLEMENTING GREEN IT

Configuring Apache Derby for Performance and Durability Olav Sandstå

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

GRID workload management system and CMS fall production. Massimo Sgaravatto INFN Padova

How to Scale out SharePoint Server 2007 from a single server farm to a 3 server farm with Microsoft Network Load Balancing on the Web servers.

Low-cost BYO Mass Storage Project. James Cizek Unix Systems Manager Academic Computing and Networking Services

Transcription:

MSU Tier 3 Usage and Troubleshooting James Koll

Overview Dedicated computing for MSU ATLAS members Flexible user environment ~500 job slots of various configurations ~150 TB disk space 2

Condor commands `condor_q` : What jobs have been submitted from this machine? `condor_q global` : What jobs have been submitted to the cluster? `condor_q better-analyze [jobid]` : What is the status of this job and why? `condor_status` How many job slots are available? Read the user section of the manual: http://research.cs.wisc.edu/htcondor/manual/v8.0/2_users_ma nual.html 3

Condor Commands condor_q better-analyze That s why my job isn t running I asked for too much memory. 4

Job constraints Memory constraints Add to submit file: Request_memory = 4000 Memory* # of slots (approximate) 1GB 150 4GB 350 *How we define these slot types is malleable to user needs. 30 of the slots are available for quick jobs To use them, just add +MSU_QUEUE = short to the submit file prior to the queue statement. Jobs marked as short will still run on other slots if no short slots are available. 5

One line submit /msu/data/t3work1/scripts/runcommand.sh [Command you want to run] Runs as short queue Brings over environmental variables Prints location of further information, such as log files Prints output text to screen until job is finished Ctrl-C before job is finished will remove job from condor 6

Disk concurrency limits Disk I/O is tricky. There are voluntary parameters you can set to tell condor how many units of disk I/O your job uses. Instructions to add are on the wiki (link on last slide). It is good practice to include concurrency limits even if you aren t submitting very many jobs. Each disk has 10,000 units of disk I/O available. Unit is difficult to define. Random vs sequential operations Reads vs writes 7

Estimating concurrency limits Use this calculator to create a conservative estimate of how many units your job uses. http://hep.pa.msu.edu/concurrencycalc.html Probably okay to use this number by default if your jobs do not use a lot of I/O. If they do use a lot of I/O, you will want to optimize it. You can try decreasing the concurrency limit and see when you run into CPU wait problems. How to detect CPU wait is discussed in a few slides. 8

Improving disk performance There are tricks you can use to improve disk performance. Slim and skim datasets. Remember sequential reads/writes are faster than random ones. Stagger job submission times and have your jobs copy their input files to local scratch. Produce output files locally, then copy output to remote work directory. Split input files between many work disks. Use faster disks (t3fast vs t3work). 9

Green is slow Flexible resources => Lots of user freedom => We can run into problems If green is slow, there are a number of things you as a user can do to see what may be wrong. There are three things you can check: CPU Memory Disk I/O** (sort of) Once you identify the problem, you can identify the responsible user and contact them directly. 10

Troubleshooting CPU load Run `top`, then press shift-p to sort by CPU usage: koll is the bad user. Contact him and tell him to submit his CPU-intensive jobs to condor. 11

Troubleshooting memory usage Is there a memory problem? Run `free`: Run `top`, then press m to sort by memory usage: Not critically low, but lower than it should be Whoa, way too high! Tell this user to check for memory leaks and/or submit to condor! 12

Troubleshooting Disk I/O Easy to cause by accident. Check for wait I/O greater than a few percent. Too high. Look for a user with a large number of jobs running on the condor queue. If you can t identify the source, contact me. 13

Resources https://www.aglt2.org/wiki/bin/view/aglt2/msutier3 If you think of something that should be in the wiki, please add it. Contact me if you need an account. Please feel free to use my contact details, located on the wiki 14