Job and Machine Policy Configuration. HTCondor Week 2015 Todd Tannenbaum
|
|
|
- Karin Adams
- 10 years ago
- Views:
Transcription
1 Job and Machine Policy Configuration HTCondor Week 2015 Todd Tannenbaum
2 Policy Expressions Policy Expressions allow jobs and machines to restrict access, handle errors and retries, perform job steering, set limits, when/where jobs can start, etc. 2
3 Assume a simple setup Lets assume a pool with only one single user (me!). no user/group scheduling concerns, we ll get to that later 3
4 We learned earlier Job submit file can specify Requirements and Rank expressions to express constraints and preferences on a match Requirements = OpSysAndVer== RedHat6 Rank = kflops Executable = matlab queue Another set of policy expressions control job status 4
5 Job Status Policy Expressions User can supply job policy expressions in the job submit file. See condor_submit man page. These expressions can reference any job ad attribute. on_exit_remove = <expression> on_exit_hold = <expression> periodic_remove = <expression> periodic_hold = <expression> periodic_release = <expression> 5
6 Job Status State Diagram submit Idle (I) Removed (X) Running (R) Held (H) Completed (C) 6
7 Job Policy Expressions Do not remove if exits with a signal: on_exit_remove = ExitBySignal == False Place on hold if exits with nonzero status or ran for less than an hour: on_exit_hold = ( ExitCode =!= 0 ) ( (time() - JobStartDate) < 3600) Place on hold if job has spent more than 50% of its time suspended: periodic_hold = ( CumulativeSuspensionTime > (RemoteWallClockTime / 2.0) ) 7
8 Job Policies by the Admin Admins can also provide supply periodic job policy expressions in the condor_config file. These expressions impact all jobs submitted to a specific schedd. system_periodic_remove = <expression> system_periodic_hold = <expression> system_periodic_release = <expression> What is the period? Frequency of evaluation is configurable via a floor (1 minute), max (20 minutes), and schedd timeslice (1%). 8
9 Startd Policy Expressions How do you specify Requirements and Rank for machine slots? Specified in condor_config Machine slot policy (or startd policy ) expressions can reference items in either the machine or candidate job ClassAd (See manual appendix for list) 9
10 Administrator Policy Expressions Some Startd Expressions (when to start/stop jobs) START = <expr> RANK = <expr> SUSPEND = <expr> CONTINUE = <expr> PREEMPT = <expr> (really means evict) And the related WANT_VACATE = <expr> 10
11 Startd s START START is the primary policy When FALSE the machine enters the Owner state and will not run jobs Acts as the Requirements expression for the machine, the job must satisfy START Can reference job ClassAd values including Owner and ImageSize 11
12 Startd s RANK Indicates which jobs a machine prefers Floating point number, just like job rank Larger numbers are higher ranked Typically evaluate attributes in the Job ClassAd Typically use + instead of && Often used to give priority to owner of a particular group of machines Claimed machines still advertise looking for higher ranked job to preempt the current job LESSON: Startd Rank creates job preemption 12
13 Startd s PREEMPT Really means vacate (I prefer nothing vs this job!) When PREEMPT becomes true, the job will be killed and go from Running to Idle Can kill nicely WANT_VACATE = <expr>; if true then send a SIGTERM and follow-up with SIGKILL after MachineMaxVacateTime seconds. Startd s Suspend and Continue When True, send SIGSTOP or SIGCONT to all processes in the job 13
14 14 Default Startd Settings Always run jobs to completion START = True RANK = 0 PREEMPT = False SUSPEND = False CONTINUE = True OR use policy: always_run_jobs
15 15 Policy Configuration I am adding special new nodes, only for simulation jobs from Math. If none, simulations from Chemistry. If none, simulations from anyone.
16 Prefer Chemistry Jobs START = KindOfJob =?= Simulation RANK = 10 * Department =?= Math + Department =?= Chemistry SUSPEND = False PREEMPT = False 16
17 Policy Configuration Don t let any job run longer than 24 hrs, except Chemistry jobs can run for 48 hrs. I R BIZNESS CAT by VMOS 2007 Licensed under the Creative Commons Attribution 2.0 license
18 Settings for showing runtime START = True RANK = 0 limits PREEMPT = TotalJobRunTime > ifthenelse(department=?= Chemistry, 48 * (60 * 60), 24 * (60 * 60) ) Note: this will result in the job going back to Idle in the queue to be rescheduled
19 Runtime limits with a chance to checkpoint START = True RANK = 0 PREEMPT = TotalJobRunTime > ifthenelse(department=?= Chemistry, WANT_VACATE = True 48 * (60 * 60), 24 * (60 * 60) ) MachineMaxVacateTime = 300 Wonder if the user will have any idea why their jobs was evicted
20 Runtime limits with job hold START = True RANK = 0 TIME_EXCEEDED = TotalJobRunTime > ifthenelse(department=?= Chemistry, 48 * (60 * 60), 24 * (60 * 60) ) PREEMPT = $(TIME_EXCEEDED) WANT_HOLD = $(TIME_EXCEEDED) WANT_HOLD_REASON = ifthenelse( Department=?= Chemistry, Chem job failed to complete in 48 hrs, Job failed to complete in 24 hrs ) 20 20
21 C:\temp>condor_q -- Submitter: ToddsThinkpad : < :49748> : ToddsThinkpad ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 tannenba 12/5 17: :00:03 H myjob.exe 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended C:\temp>condor_q -hold -- Submitter: ToddsThinkpad : < :49748> : ToddsThinkpad ID OWNER HELD_SINCE HOLD_REASON 1.0 tannenba 12/6 17:29 Job failed to complete in 24 hrs 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended 21
22 Could we implement via job policy instead of startd policy? Yes. Put in condor_config on submit host: SYSTEM_PERIODIC_HOLD = (time() JobStartTime) > (24*60*60)) SYSTEM_PERIODIC_HOLD_REASON = Job failed to complete in 24 hrs Which to use? You may only have control of one or the other Startd policy evaluated more often (every 5 secs) Consider if policy is associated with the job or with the machine keep responsibilities of both in mind 22
23 STARTD SLOT CONFIGURATION 23
24 Custom Attributes in Slot Ads Several ways to add custom attributes into your slot ads From the config file(s) (for static attributes) From a script (for dynamic attributes) From the job ad of the job running on that slot From other slots on the machine Can add a custom attribute to all slots on a machine, or only specific slots 24
25 Define your own slot attributes that jobs can match against. If you want the slot to have HasMatlab=true Define value in config, and then add name to STARTD_EXPRS (or _ATTRS), like this STARTD_EXPRS = $(STARTD_EXPRS) HasMatlab HasMatlab = true Custom Attributes from config file (static attributes) Or SLOTX_HasMatlab = true Note: Also SUBMIT_EXPRS, SCHEDD_EXPRS, MASTER_EXPRS, 25
26 Custom Attributes from a script (for dynamic attrs) Run a script/program to define attributes Script returns a ClassAd Attributes are merged into Slot ClassAds STARTD_CRON_JOB_LIST = tag STARTD_CRON_tag_EXECUTABLE = detect.sh Run once or periodically Control which slots get the attributes with SlotMergeConstraint or SlotId 26
27 Custom attributes from job classad running on the slot Can take attribute X from job ad and publish it into slot ad when job is running on this slot. condor_config : STARTD_JOB_EXPRS = $(STARTD_JOB_EXPRS) CanCheckpoint Example submit file: executable = foo +CanCheckpoint = True queue Now can reference attributes of the job in machine policy and PREEMPTION_REQUIREMENTS, e.g. PREEMPT = CanCheckpoint =?= True 27
28 Cross publishing Slot Attributes Policy expressions sometimes need to refer to attributes from other slots Cross-publish with STARTD_SLOT_ATTRS STARTD_SLOT_ATTRS = State, Activity Now all slots can see Slot1_State, Slot2_state, Each slot s attrs published in ALL slot ads with SlotN_X, so you can do this: START = $(START) && (SlotId==2 && \ Slot1_State!= "Claimed") && 28
29 Default Behavior: One static slot per core One static execution slot per CPU core, such that each slot is single core slot Other resources (Disk, Memory, etc) are divided evenly among the slots How can I customize this? 29
30 It is easy to Lie! Set Arbitrary values for CPUs, Memory HTCondor will allocate resources to slots as if the values are correct NUM_CPUS = 99 MEMORY = \ $(DETECTED_MEMORY)*99 Default values: NUM_CPUS = $(DETECTED_CPUS) MEMORY = $(DETECTED_MEMORY) 30
31 Control how resources are allocated to slots Define up to 10 slot types For each slot type you can define A name prefix for the slot (defaults to slot ) The number of slots of that type to create How many machine resources each slot should contain Policy knobs (START, PREEMPT, etc) per slot type 31
32 Perhaps to match your job load Examples Why? Each slot has two cores NUM_SLOTS_TYPE_1 = $(DETECTED_CPUS)/2 SLOT_TYPE_1 = Cpus=2 Non-uniform static slots: Make one big slot, the rest in single core slots NUM_SLOTS_TYPE_1 = 1 SLOT_TYPE_1 = Cpus=4, Memory=50% NUM_SLOTS_TYPE_2 = $(DETECTED_CPUS)-4 SLOT_TYPE_2 = Cpus=1 32
33 [Aside: Big slot plus small slots] How to steer single core jobs away from the big multi-core slots Via job ad (if you control submit machine ) DEFAULT_RANK = RequestCPUs CPUs Via condor_negotiator on central manager NEGOTIATOR_PRE_JOB_RANK = \ RequestCPUs - CPUs 33
34 # Lie about the number of CPUs NUM_CPUS = $(DETECTED_CPUS)+1 # Define standard static slots NUM_SLOTS_TYPE_1 = $(DETECTED_CPUS) # Define a maintenance slot NUM_SLOTS_TYPE_2 = 1 SLOT_TYPE_2 = cpus=1, memory=1000 SLOT_TYPE_2_NAME_PREFIX = maint SLOT_TYPE_2_START = owner=="tannenba" SLOT_TYPE_2_PREEMPT = false Another why Special Purpose Slots Slot for a special purpose, e.g data movement, maintenance, interactive jobs, etc 34
35 C:\home\tannenba>condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime WINDOWS X86_64 Unclaimed Idle :00:08 WINDOWS X86_64 Unclaimed Idle :00:04 WINDOWS X86_64 Unclaimed Idle :00:05 WINDOWS X86_64 Unclaimed Idle :00:06 WINDOWS X86_64 Unclaimed Idle :00:07 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/WINDOWS Total
36 Defining a custom resource Define a custom STARTD resource MACHINE_RESOURCE_<tag> Can come from a script (if you want dynamic discovery) MACHINE_RESOURCE_INVENTORY_<tag> <tag> is case preserving, case insensitive 37
37 Fungible resources For OS virtualized resources Cpus, Memory, Disk For intangible resources Bandwidth Licenses? Works with Static and Partitionable slots 38
38 Fungible custom resource example : bandwidth (1) > condor_config_val dump Bandwidth MACHINE_RESOURCE_Bandwidth = 1000 > grep i bandwidth userjob.submit REQUEST_Bandwidth =
39 Fungible custom resource example : bandwidth (2) Assuming 4 static slots > condor_status long grep i bandwidth Bandwidth = 250 DetectedBandwidth = 1000 TotalBandwidth = 1000 TotalSlotBandwidth =
40 Non-fungible resources For resources not virtualized by OS GPUs, Instruments, Directories Configure by listing resource ids Quantity is inferred Specific id(s) are assigned to slots Works with Static and Partitionable slots 41
41 Non-fungible custom resource example : GPUs (1) > condor_config_val dump gpus MACHINE_RESOURCE_GPUs = CUDA0, CUDA1 ENVIRONMENT_FOR_AssignedGPUs = GPU_NAME GPU_ID=/CUDA// ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = > grep i gpus userjob.submit REQUEST_GPUs = 1 Or use feature: GPUs 42
42 Non-fungible custom resource example : GPUs (2) > condor_status long slot1 grep i gpus AssignedGpus = "CUDA0" DetectedGPUs = 2 GPUs = 1 TotalSlotGPUs = 1 TotalGPUs = 2 43
43 Non-fungible custom resource example : GPUs (3) Environment of a job running on that slot > env grep I CUDA _CONDOR_AssignedGPUs = CUDA0 GPU_NAME = CUDA0 GPU_ID = 0 44
44 Non-uniform static slots help, but not perfect 8 Gb machine partitioned into 5 slots 4Gb Slot 1Gb 1Gb 1Gb 1Gb 4 Gb Job 1Gb 1Gb 45
45 8 Gb machine partitioned into 5 slots 4Gb Slot 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 46
46 8 Gb machine partitioned into 5 slots 4Gb Slot 1Gb 1Gb 1Gb 1Gb 1Gb 7 Gb free, but idle job 4 Gb Job 47
47 Partitionable Slots: The big idea One parent partionable slot From which child dynamic slots are made at claim time When dynamic slots are released, their resources are merged back into the partionable parent slot 48
48 (cont) Partionable slots split on Cpu Disk Memory (plus any custom startd resources you defined) When you are out of CPU or Memory, you re out of slots 49
49 3 types of slots Static (e.g. the usual kind) Partitionable (e.g. leftovers) Dynamic (claimed slots carved off a partitionable slot parent) Dynamically created at claim time But once created, act as static When unclaimed, resources go back to partitionable parent 50
50 8 Gb Partitionable slot 4Gb 51
51 8 Gb Partitionable slot 5Gb 52
52 How to configure NUM_SLOTS = 1 NUM_SLOTS_TYPE_1 = 1 SLOT_TYPE_1 = cpus=100% SLOT_TYPE_1_PARTITIONABLE = true 53
53 Looks like $ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@c LINUX X86_64 Unclaimed Idle Total Owner Claimed Unclaimed Matched X86_64/LINUX Total
54 When running $ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@c LINUX X86_64 Unclaimed Idle slot1_1@c LINUX X86_64 Claimed Busy slot1_2@c LINUX X86_64 Claimed Busy slot1_3@c LINUX X86_64 Claimed Busy
55 Can specify default Request values JOB_DEFAULT_REQUEST_MEMORY ifthenelse(memoryusage =!= UNDEF, Memoryusage, 1) JOB_DEFAULT_REQUEST_CPUS 1 JOB_DEFAULT_REQUEST_DISK DiskUsage 56
56 Fragmentation Name OpSys Arch State Activity LoadAv Mem LINUX X86_64 Unclaimed Idle LINUX X86_64 Claimed Busy LINUX X86_64 Claimed Busy LINUX X86_64 Claimed Busy Now I submit a job that needs 8G what happens? 57
57 Solution: Draining condor_drain condor_defrag One daemon defrags whole pool Central manager good place to run Scan pool, try to fully defrag some startds by invoking condor_drain to drain machine Only looks at partitionable machines Admin picks some % of pool that can be whole Note: some heuristics can reduce need to defrag, such as packing large jobs on high number nodes and low jobs on low number 58
58 Oh, we got knobs DEFRAG_DRAINING_MACHINES_PER_HOUR default is 0 DEFRAG_MAX_WHOLE_MACHINES default is -1 DEFRAG_SCHEDULE graceful (obey MaxJobRetirementTime, default) quick (obey MachineMaxVacateTime) fast (hard-killed immediately) 59
59 Future work on Partitionable Slots See my talk from Wednesday! 60
60 Further Information For further information, see section 3.5 Policy Configuration for the condor_startd in the Condor manual Condor Wiki Administrator HOWTOs at condor-users mailing list 61
Dynamic Slot Tutorial. Condor Project Computer Sciences Department University of Wisconsin-Madison
Dynamic Slot Tutorial Condor Project Computer Sciences Department University of Wisconsin-Madison Outline Why we need partitionable slots How they ve worked since 7.2 What s new in 7.8 What s still left
Condor: Grid Scheduler and the Cloud
Condor: Grid Scheduler and the Cloud Matthew Farrellee Senior Software Engineer, Red Hat 1 Agenda What is Condor Architecture Condor s ClassAd Language Common Use Cases Virtual Machine management Cloud
HTCondor at the RAL Tier-1
HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier, James Adams STFC Rutherford Appleton Laboratory HTCondor Week 2014 Outline Overview of HTCondor at RAL Monitoring Multi-core
Dynamic Resource Provisioning with HTCondor in the Cloud
Dynamic Resource Provisioning with HTCondor in the Cloud Ryan Taylor Frank Berghaus 1 Overview Review of Condor + Cloud Scheduler system Condor job slot configuration Dynamic slot creation Automatic slot
Operating Systems Concepts: Chapter 7: Scheduling Strategies
Operating Systems Concepts: Chapter 7: Scheduling Strategies Olav Beckmann Huxley 449 http://www.doc.ic.ac.uk/~ob3 Acknowledgements: There are lots. See end of Chapter 1. Home Page for the course: http://www.doc.ic.ac.uk/~ob3/teaching/operatingsystemsconcepts/
HTCondor within the European Grid & in the Cloud
HTCondor within the European Grid & in the Cloud Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX 2015 Spring Workshop, Oxford The Grid Introduction Computing element requirements Job submission
LoadLeveler Overview. January 30-31, 2012. IBM Storage & Technology Group. IBM HPC Developer Education @ TIFR, Mumbai
IBM HPC Developer Education @ TIFR, Mumbai IBM Storage & Technology Group LoadLeveler Overview January 30-31, 2012 Pidad D'Souza ([email protected]) IBM, System & Technology Group 2009 IBM Corporation
A Year of HTCondor Monitoring. Lincoln Bryant Suchandra Thapa
A Year of HTCondor Monitoring Lincoln Bryant Suchandra Thapa HTCondor Week 2015 May 21, 2015 Analytics vs. Operations Two parallel tracks in mind: o Operations o Analytics Operations needs to: o Observe
Using Parallel Computing to Run Multiple Jobs
Beowulf Training Using Parallel Computing to Run Multiple Jobs Jeff Linderoth August 5, 2003 August 5, 2003 Beowulf Training Running Multiple Jobs Slide 1 Outline Introduction to Scheduling Software The
MSU Tier 3 Usage and Troubleshooting. James Koll
MSU Tier 3 Usage and Troubleshooting James Koll Overview Dedicated computing for MSU ATLAS members Flexible user environment ~500 job slots of various configurations ~150 TB disk space 2 Condor commands
Real-Time Scheduling 1 / 39
Real-Time Scheduling 1 / 39 Multiple Real-Time Processes A runs every 30 msec; each time it needs 10 msec of CPU time B runs 25 times/sec for 15 msec C runs 20 times/sec for 5 msec For our equation, A
LINUX CLUSTER MANAGEMENT TOOLS
LINUX CLUSTER MANAGEMENT TOOLS Parallel Computing and Applications, Jan 7-14, 2005, IMSc, Chennai N. Sakthivel Institute for Plasma Research Gandhinagar - 382 428 [email protected] STRUCTURE OF DISCUSSION
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
University of Pennsylvania Department of Electrical and Systems Engineering Digital Audio Basics
University of Pennsylvania Department of Electrical and Systems Engineering Digital Audio Basics ESE250 Spring 2013 Lab 9: Process CPU Sharing Friday, March 15, 2013 For Lab Session: Thursday, March 21,
Kiko> A personal job scheduler
Kiko> A personal job scheduler V1.2 Carlos allende prieto october 2009 kiko> is a light-weight tool to manage non-interactive tasks on personal computers. It can improve your system s throughput significantly
SQL Server Performance Tuning for DBAs
ASPE IT Training SQL Server Performance Tuning for DBAs A WHITE PAPER PREPARED FOR ASPE BY TOM CARPENTER www.aspe-it.com toll-free: 877-800-5221 SQL Server Performance Tuning for DBAs DBAs are often tasked
An objective comparison test of workload management systems
An objective comparison test of workload management systems Igor Sfiligoi 1 and Burt Holzman 1 1 Fermi National Accelerator Laboratory, Batavia, IL 60510, USA E-mail: [email protected] Abstract. The Grid
Fair Scheduler. Table of contents
Table of contents 1 Purpose... 2 2 Introduction... 2 3 Installation... 3 4 Configuration...3 4.1 Scheduler Parameters in mapred-site.xml...4 4.2 Allocation File (fair-scheduler.xml)... 6 4.3 Access Control
SLURM Resources isolation through cgroups. Yiannis Georgiou email: [email protected] Matthieu Hautreux email: matthieu.hautreux@cea.
SLURM Resources isolation through cgroups Yiannis Georgiou email: [email protected] Matthieu Hautreux email: [email protected] Outline Introduction to cgroups Cgroups implementation upon
10.04.2008. Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details
Thomas Fahrig Senior Developer Hypervisor Team Hypervisor Architecture Terminology Goals Basics Details Scheduling Interval External Interrupt Handling Reserves, Weights and Caps Context Switch Waiting
Monitoring HTCondor with Ganglia
Monitoring HTCondor with Ganglia Ganglia Overview Scalable distributed monitoring for HPC clusters Two daemons gmond every host; collects and send metrics gmetad single host; persists metrics from local
CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study
CS 377: Operating Systems Lecture 25 - Linux Case Study Guest Lecturer: Tim Wood Outline Linux History Design Principles System Overview Process Scheduling Memory Management File Systems A review of what
Introduction to Sun Grid Engine (SGE)
Introduction to Sun Grid Engine (SGE) What is SGE? Sun Grid Engine (SGE) is an open source community effort to facilitate the adoption of distributed computing solutions. Sponsored by Sun Microsystems
Typeset in L A TEX from SGML source using the DocBuilder-0.9.8 Document System.
OS Mon version 2.1 Typeset in L A TEX from SGML source using the DocBuilder-0.9.8 Document System. Contents 1 OS Mon Reference Manual 1 1.1 os mon............................................ 4 1.2 cpu
Introduction to Apache YARN Schedulers & Queues
Introduction to Apache YARN Schedulers & Queues In a nutshell, YARN was designed to address the many limitations (performance/scalability) embedded into Hadoop version 1 (MapReduce & HDFS). Some of the
KVM & Memory Management Updates
KVM & Memory Management Updates KVM Forum 2012 Rik van Riel Red Hat, Inc. KVM & Memory Management Updates EPT Accessed & Dirty Bits 1GB hugepages Balloon vs. Transparent Huge Pages Automatic NUMA Placement
Vulnerability Assessment for Middleware
Vulnerability Assessment for Middleware Elisa Heymann, Eduardo Cesar Universitat Autònoma de Barcelona, Spain Jim Kupsch, Barton Miller University of Wisconsin-Madison Barcelona, September 21st 2009 Key
I-Motion SQL Server admin concerns
I-Motion SQL Server admin concerns I-Motion SQL Server admin concerns Version Date Author Comments 4 2014-04-29 Rebrand 3 2011-07-12 Vincent MORIAUX Add Maintenance Plan tutorial appendix Add Recommended
Chapter 5 Linux Load Balancing Mechanisms
Chapter 5 Linux Load Balancing Mechanisms Load balancing mechanisms in multiprocessor systems have two compatible objectives. One is to prevent processors from being idle while others processors still
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
ò Paper reading assigned for next Thursday ò Lab 2 due next Friday ò What is cooperative multitasking? ò What is preemptive multitasking?
Housekeeping Paper reading assigned for next Thursday Scheduling Lab 2 due next Friday Don Porter CSE 506 Lecture goals Undergrad review Understand low-level building blocks of a scheduler Understand competing
Grid Engine 6. Policies. BioTeam Inc. [email protected]
Grid Engine 6 Policies BioTeam Inc. [email protected] This module covers High level policy config Reservations Backfilling Resource Quotas Advanced Reservation Job Submission Verification We ll be talking
Red Hat and Condor Project
Red Hat and Condor Project Matthew Farrellee Principal Software Engineer Today... History MRG status Open source Management Cloud History Condor Week 2007 Condor Week 2009 -Red Hat/Condor collaboration
Capacity Estimation for Linux Workloads
Capacity Estimation for Linux Workloads Session L985 David Boyes Sine Nomine Associates 1 Agenda General Capacity Planning Issues Virtual Machine History and Value Unique Capacity Issues in Virtual Machines
Content Distribution Management
Digitizing the Olympics was truly one of the most ambitious media projects in history, and we could not have done it without Signiant. We used Signiant CDM to automate 54 different workflows between 11
EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun
EECS 750: Advanced Operating Systems 01/28 /2015 Heechul Yun 1 Recap: Completely Fair Scheduler(CFS) Each task maintains its virtual time V i = E i 1 w i, where E is executed time, w is a weight Pick the
HPSA Agent Characterization
HPSA Agent Characterization Product HP Server Automation (SA) Functional Area Managed Server Agent Release 9.0 Page 1 HPSA Agent Characterization Quick Links High-Level Agent Characterization Summary...
Virtual Server and Storage Provisioning Service. Service Description
RAID Virtual Server and Storage Provisioning Service Service Description November 28, 2008 Computer Services Page 1 TABLE OF CONTENTS INTRODUCTION... 4 VIRTUAL SERVER AND STORAGE PROVISIONING SERVICE OVERVIEW...
Microsoft HPC. V 1.0 José M. Cámara ([email protected])
Microsoft HPC V 1.0 José M. Cámara ([email protected]) Introduction Microsoft High Performance Computing Package addresses computing power from a rather different approach. It is mainly focused on commodity
Systemverwaltung 2009 AIX / LPAR
Systemverwaltung 2009 AIX / LPAR Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.1 Integrated Virtualization Manager (IVM) (1 of 2) Provides
XGenPlus Installation Guide
Copyright DATA INFOCOM LIMITED. All Rights Reserved. No part of this work may be duplicated or reproduced without the express permission of its copyright holders. Visit www.datainfocom.in for more details
Batch Job Analysis to Improve the Success Rate in HPC
Batch Job Analysis to Improve the Success Rate in HPC 1 JunWeon Yoon, 2 TaeYoung Hong, 3 ChanYeol Park, 4 HeonChang Yu 1, First Author KISTI and Korea University, [email protected] 2,3, KISTI,[email protected],[email protected]
Hadoop Fair Scheduler Design Document
Hadoop Fair Scheduler Design Document October 18, 2010 Contents 1 Introduction 2 2 Fair Scheduler Goals 2 3 Scheduler Features 2 3.1 Pools........................................ 2 3.2 Minimum Shares.................................
This guide specifies the required and supported system elements for the application.
System Requirements Contents System Requirements... 2 Supported Operating Systems and Databases...2 Features with Additional Software Requirements... 2 Hardware Requirements... 4 Database Prerequisites...
Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource
PBS INTERNALS PBS & TORQUE PBS (Portable Batch System)-software system for managing system resources on workstations, SMP systems, MPPs and vector computers. It was based on Network Queuing System (NQS)
Operating Systems. III. Scheduling. http://soc.eurecom.fr/os/
Operating Systems Institut Mines-Telecom III. Scheduling Ludovic Apvrille [email protected] Eurecom, office 470 http://soc.eurecom.fr/os/ Outline Basics of Scheduling Definitions Switching
Achieving business benefits through automated software testing. By Dr. Mike Bartley, Founder and CEO, TVS (mike@testandverification.
Achieving business benefits through automated software testing By Dr. Mike Bartley, Founder and CEO, TVS ([email protected]) 1 Introduction During my experience of test automation I have seen
159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354
159.735 Final Report Cluster Scheduling Submitted by: Priti Lohani 04244354 1 Table of contents: 159.735... 1 Final Report... 1 Cluster Scheduling... 1 Table of contents:... 2 1. Introduction:... 3 1.1
Simplest Scalable Architecture
Simplest Scalable Architecture NOW Network Of Workstations Many types of Clusters (form HP s Dr. Bruce J. Walker) High Performance Clusters Beowulf; 1000 nodes; parallel programs; MPI Load-leveling Clusters
Deployment Planning Guide
Deployment Planning Guide August 2011 Copyright: 2011, CCH, a Wolters Kluwer business. All rights reserved. Material in this publication may not be reproduced or transmitted in any form or by any means,
SUSE Manager in the Public Cloud. SUSE Manager Server in the Public Cloud
SUSE Manager in the Public Cloud SUSE Manager Server in the Public Cloud Contents 1 Instance Requirements... 2 2 Setup... 3 3 Registration of Cloned Systems... 6 SUSE Manager delivers best-in-class Linux
University of Southern California Shibboleth High Availability with Terracotta
University of Southern California Shibboleth High Availability with Terracotta Overview Intro to HA architecture with Terracotta Benefits Drawbacks Shibboleth and Terracotta at USC Monitoring Issues Resolved
Linux Scheduler. Linux Scheduler
or or Affinity Basic Interactive es 1 / 40 Reality... or or Affinity Basic Interactive es The Linux scheduler tries to be very efficient To do that, it uses some complex data structures Some of what it
Kerrighed / XtreemOS cluster flavour
Kerrighed / XtreemOS cluster flavour Jean Parpaillon Reisensburg Castle Günzburg, Germany July 5-9, 2010 July 6th, 2010 Kerrighed - XtreemOS cluster flavour 1 Summary Kerlabs Context Kerrighed Project
Real-time KVM from the ground up
Real-time KVM from the ground up KVM Forum 2015 Rik van Riel Red Hat Real-time KVM What is real time? Hardware pitfalls Realtime preempt Linux kernel patch set KVM & qemu pitfalls KVM configuration Scheduling
Chapter 2: Getting Started
Chapter 2: Getting Started Once Partek Flow is installed, Chapter 2 will take the user to the next stage and describes the user interface and, of note, defines a number of terms required to understand
ò Scheduling overview, key trade-offs, etc. ò O(1) scheduler older Linux scheduler ò Today: Completely Fair Scheduler (CFS) new hotness
Last time Scheduling overview, key trade-offs, etc. O(1) scheduler older Linux scheduler Scheduling, part 2 Don Porter CSE 506 Today: Completely Fair Scheduler (CFS) new hotness Other advanced scheduling
ICS 143 - Principles of Operating Systems
ICS 143 - Principles of Operating Systems Lecture 5 - CPU Scheduling Prof. Nalini Venkatasubramanian [email protected] Note that some slides are adapted from course text slides 2008 Silberschatz. Some
Hardware Performance Optimization and Tuning. Presenter: Tom Arakelian Assistant: Guy Ingalls
Hardware Performance Optimization and Tuning Presenter: Tom Arakelian Assistant: Guy Ingalls Agenda Server Performance Server Reliability Why we need Performance Monitoring How to optimize server performance
A Comparison of Oracle Performance on Physical and VMware Servers
A Comparison of Oracle Performance on Physical and VMware Servers By Confio Software Confio Software 4772 Walnut Street, Suite 100 Boulder, CO 80301 303-938-8282 www.confio.com Comparison of Physical and
CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems
Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems Based on original slides by Silberschatz, Galvin and Gagne 1 Basic Concepts CPU I/O Burst Cycle Process execution
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
NEWSTAR ENTERPRISE and NEWSTAR Sales System Requirements
NEWSTAR ENTERPRISE and NEWSTAR Sales System Requirements July 21 st, 2014 Page 1 of 7 NEWSTAR Enterprise ( NSE ) and NEWSTAR Sales ( NSS ) on SQL Note: To determine the number of users, add the number
Linux Syslog Messages in IBM Director
Ever want those pesky little Linux syslog messages (/var/log/messages) to forward to IBM Director? Well, it s not built in, but it s pretty easy to setup. You can forward syslog messages from an IBM Director
Oracle Licensing Optimization (Complexity, Compliance, Configuration) Ed Hut Karen O Neill Oct 23, 2014
Oracle Licensing Optimization (Complexity, Compliance, Configuration) Ed Hut Karen O Neill Oct 23, 2014 Presenters Ed Hut, CEO, DBAK Ed is the Chief Executive Officer at DBAK and has been working with
Crystal Quality Business Optimization System Installation Guide
Crystal Quality Business Optimization System Installation Guide Crystal Quality Installation Crystal quality has a fully automated installation package that install all the needed softwares and components.
CapacityScheduler Guide
Table of contents 1 Purpose... 2 2 Overview... 2 3 Features...2 4 Installation... 3 5 Configuration...4 5.1 Using the CapacityScheduler...4 5.2 Setting up queues...4 5.3 Queue properties... 4 5.4 Resource
W4118 Operating Systems. Instructor: Junfeng Yang
W4118 Operating Systems Instructor: Junfeng Yang Outline Advanced scheduling issues Multilevel queue scheduling Multiprocessor scheduling issues Real-time scheduling Scheduling in Linux Scheduling algorithm
Linux on z/vm Memory Management
Linux on z/vm Memory Management Rob van der Heij rvdheij @ velocitysoftware.com IBM System z Technical Conference Brussels, 2009 Session LX45 Velocity Software, Inc http://www.velocitysoftware.com/ Copyright
SIDN Server Measurements
SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010-003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources
System Resources. To keep your system in optimum shape, you need to be CHAPTER 16. System-Monitoring Tools IN THIS CHAPTER. Console-Based Monitoring
CHAPTER 16 IN THIS CHAPTER. System-Monitoring Tools. Reference System-Monitoring Tools To keep your system in optimum shape, you need to be able to monitor it closely. Such monitoring is imperative in
Energy-aware Memory Management through Database Buffer Control
Energy-aware Memory Management through Database Buffer Control Chang S. Bae, Tayeb Jamel Northwestern Univ. Intel Corporation Presented by Chang S. Bae Goal and motivation Energy-aware memory management
Delivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
ProSystem fx Engagement. Deployment Planning Guide
ProSystem fx Engagement Deployment Planning Guide September 2011 Copyright: 2011, CCH, a Wolters Kluwer business. All rights reserved. Material in this publication may not be reproduced or transmitted
Facultat d'informàtica de Barcelona Univ. Politècnica de Catalunya. Administració de Sistemes Operatius. System monitoring
Facultat d'informàtica de Barcelona Univ. Politècnica de Catalunya Administració de Sistemes Operatius System monitoring Topics 1. Introduction to OS administration 2. Installation of the OS 3. Users management
Grid 101. Grid 101. Josh Hegie. [email protected] http://hpc.unr.edu
Grid 101 Josh Hegie [email protected] http://hpc.unr.edu Accessing the Grid Outline 1 Accessing the Grid 2 Working on the Grid 3 Submitting Jobs with SGE 4 Compiling 5 MPI 6 Questions? Accessing the Grid Logging
A High Performance Computing Scheduling and Resource Management Primer
LLNL-TR-652476 A High Performance Computing Scheduling and Resource Management Primer D. H. Ahn, J. E. Garlick, M. A. Grondona, D. A. Lipari, R. R. Springmeyer March 31, 2014 Disclaimer This document was
CSE 265: System and Network Administration. CSE 265: System and Network Administration
CSE 265: System and Network Administration MW 9:10-10:00am Packard 258 F 9:10-11:00am Packard 112 http://www.cse.lehigh.edu/~brian/course/sysadmin/ Find syllabus, lecture notes, readings, etc. Instructor:
An Efficient Use of Virtualization in Grid/Cloud Environments. Supervised by: Elisa Heymann Miquel A. Senar
An Efficient Use of Virtualization in Grid/Cloud Environments. Arindam Choudhury Supervised by: Elisa Heymann Miquel A. Senar Index Introduction Motivation Objective State of Art Proposed Solution Experimentations
CPU Scheduling Outline
CPU Scheduling Outline What is scheduling in the OS? What are common scheduling criteria? How to evaluate scheduling algorithms? What are common scheduling algorithms? How is thread scheduling different
