How to control Resource allocation on pseries multi MCM system



Similar documents
Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Cloud Computing through Virtualization and HPC technologies

LoadLeveler Overview. January 30-31, IBM Storage & Technology Group. IBM HPC Developer TIFR, Mumbai

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

IBM LoadLeveler for Linux delivers job scheduling for IBM pseries and IBM xseries platforms running Linux

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

The Truth Behind IBM AIX LPAR Performance

How To Run A Power5 On A Powerbook On A Mini Computer (Power5) On A Microsoft Powerbook (Power4) On An Ipa (Power3) On Your Computer Or Ipa On A Minium (Power2

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Systemverwaltung 2009 AIX / LPAR

Sun Solaris Container Technology vs. IBM Logical Partitions and HP Virtual Partitions

Delivering Quality in Software Performance and Scalability Testing

Improving Compute Farm Throughput in Electronic Design Automation (EDA) Solutions

Virtualization 101 for Power Systems

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

Monitoring IBM HMC Server. eg Enterprise v6

PARALLELS SERVER BARE METAL 5.0 README

LS DYNA Performance Benchmarks and Profiling. January 2009

Running a Workflow on a PowerCenter Grid

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Current Oracle license rules and definitions. Oracle licensing.

How System Settings Impact PCIe SSD Performance

General Overview of Shared-Memory Multiprocessor Systems

FLOW-3D Performance Benchmark and Profiling. September 2012

Multi-core Programming System Overview

Directions for VMware Ready Testing for Application Software

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

AIX 5L NFS Client Performance Improvements for Databases on NAS

POSIX and Object Distributed Storage Systems

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Capacity planning for IBM Power Systems using LPAR2RRD.

Solution Guide Parallels Virtualization for Linux

Audit & Tune Deliverables

OPERATING SYSTEMS SCHEDULING

OS Thread Monitoring for DB2 Server

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Load Balancing in cloud computing

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

VI Performance Monitoring

Hadoop Architecture. Part 1

Capacity Estimation for Linux Workloads

Best Practices on monitoring Solaris Global/Local Zones using IBM Tivoli Monitoring

Clusters: Mainstream Technology for CAE

LSKA 2010 Survey Report Job Scheduler

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

OGF25/EGEE User Forum Catania, Italy 2 March 2009

Bernie Velivis President, Performax Inc

SLURM Workload Manager

Agenda. Capacity Planning practical view CPU Capacity Planning LPAR2RRD LPAR2RRD. Discussion. Premium features Future

Virtualization what it is?

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

Optimizing Shared Resource Contention in HPC Clusters

Multi-Channel Clustered Web Application Servers

Recommendations for Performance Benchmarking

Understanding the Benefits of IBM SPSS Statistics Server

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Configuration Maximums VMware Infrastructure 3

Technical Computing Suite Job Management Software

System performance monitoring in RTMT

Windows Server 2008 R2 Hyper-V Live Migration

CPU Scheduling Outline

Muse Server Sizing. 18 June Document Version Muse

Hadoop: Embracing future hardware

Scheduling and Resource Management in Computational Mini-Grids

SQL Server Business Intelligence on HP ProLiant DL785 Server

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Running applications on the Cray XC30 4/12/2015

How To Make Your Computer More Efficient

GraySort on Apache Spark by Databricks

MOSIX: High performance Linux farm

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

System Software for High Performance Computing. Joe Izraelevitz

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Virtualization. Michael Tsai 2015/06/08

Return to Basics I : Understanding POWER7 Capacity Entitlement and Virtual Processors VN211 Rosa Davidson

SCALABILITY AND AVAILABILITY

Virtualization management tools

Resource Scheduling Best Practice in Hybrid Clusters

Maximizing SQL Server Virtualization Performance

SAS deployment on IBM Power servers with IBM PowerVM dedicated-donating LPARs

Amazon EC2 Product Details Page 1 of 5

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

General Overview. Slurm Training15. Alfred Gil & Jordi Blasco (HPCNow!)

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Free performance monitoring and capacity planning for IBM Power Systems

Efficient Parallel Processing on Public Cloud Servers Using Load Balancing

This presentation explains how to monitor memory consumption of DataStage processes during run time.

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003

MPI and Hybrid Programming Models. William Gropp

Distributed Operating Systems. Cluster Systems

Comparing Dynamic Disk Pools (DDP) with RAID-6 using IOR

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

HPSA Agent Characterization

Performance Tuning of Virtual Servers TAC9872. John A. Davis Senior Consulting Engineer

IBM Software Group. Lotus Domino 6.5 Server Enablement

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

High Performance Computing in CST STUDIO SUITE

The International Journal Of Science & Technoledge (ISSN X)

Operating System Tutorial

Transcription:

How to control Resource allocation on pseries multi system Pascal Vezolle Deep Computing EMEA ATS-P.S.S.C/ Montpellier FRANCE Agenda AIX Resource Management Tools WorkLoad Manager (WLM) Affinity Services Bind command and Resource Set Memory Affinity Number of s impact on Parallel job performance Versatile System Resource Allocation and Control PSSC tool integrating AIX resource capabilities in HPC environment 1

Customer Resource control requirements versus AIX capabilities Dedicated physical resources for applications or user groups Solution 1: Partitioning Solution 2: WorkLoad Manager (WLM) based on Resource Set Control Resources allocation on standalone system Control resources consumption (CPU, memory, IO) Solution: WLM classes (Consumable features in LoadLeveler) Control interactive and batch jobs ratio Solution: WLM & LoadLeveler Control job priority: Preemption (i.e. free resources for running high priority jobs) Solution: WLM class tiers&shares or LoadLeveler Gang scheduler (2005 with backfill scheduler) Optimize performances for HPC applications Memory affinity Binding and Resource Set attachment to guarantee Affinity Dedicated physical resources What Tools are Available? Partitioning Subdivide a larger machine into several smaller servers Allows multiple OS instances to peacefully coexist Static - AIX 5.1, Dynamic - AIX 5.2 Resources includes: CPU, Memory, I/O adapter (Granularity: 1CPU, 256MB memory, 1PCI slot) drawbacks in HPC: no Memory Affinity (except for Affinity LPAR), no shared IO and network adapters, limited SMP capabilities for multi-threads applications AIX Workload Manager Controls access to the resources of a single AIX instance Allows multiple workloads to peacefully coexist on one AIX image Limits the CPU, Memory and disk I/O bandwidth consumption Granularity: Percentage of CPU time, physical memory, and disk I/O bandwidth 2

Elements of WLM Preemption Priority Tiers Classification Classes Rules Users Groups Applications Types Tags Resource Limits Shares RSET Resource Control (consumption and location) WLM shares and limits Very useful to control interactive usage while insuring Batch resources (shares) and job priority Example on p690 32 processors Requirements (slovakian, tunisian, hungary met): guaranty 8 CPU for interactive and 24 CPU for batch 1 ultra priority batch class for urgent jobs One WLM solution: 3 WLM classes (default, batchurgent and batch) with shares and limits Class Name Shares batchurgent batch default 1000 75 25 100% Hard CPU limit = 75% for batchurgent class (let 8 CPU free for interactive) No Urgent job: shares=100, 75 (24 CPUs) for batch and 25 ( 8 CPUs) for interactive Batch Urgent jobs running: shares=1100, 1000 parts for priority job limited to 24 CPUs 3

AIX Affinity Services Processor Affinity 2 programming models for CPU binding 1) Attach a process to a specific CPU ID (with bindprocessor command or API ; root and no root user) 2) Attach a process to Resource Set (with RSET APIs and commands) Memory Affinity AIX tries to allocation the memory on the same containing the CPU 1) bindprocessor process [ ProcessorNum ] -q -u Process -q Displays the processors which are available. -u Unbinds the threads of the specified process. To bind process pid=999 to processor 1: bindprocessor 999 1 To display the processor number where the process is bound : ps o bnd p 999 Resource set A resource set structure is a set of physical resources: CPU (cpu are identified by a CPU ID created at boot time) Memory pool (current AIX supports only one pool) Attach a process to a rset limits the process to only use the physical resources containing in the rset (no thread featureb in AIX5.2B) Available with partitioning System rset sys/sys contains the available CPU and memory pool How to display current rset configuration: lsrset a -v AIX 5.2: Dynamic management capabilities for root and no root users with APIs and AIX commands 2 types of rset: partition rset and effective rset partition rset: restricted to user root + only one rset per process effective rset: can be used by no root user with CAP_NUMA_ATTACH capabilty: > chuser capabilities=cap_numa_attach,cap_propagate username > or add in /etc/security/user file 4

AIX 5.2 Resource Set commands http://publib16.boulder.ibm.com/pseries //publib16.boulder.ibm.com/cgi-bin Create a rset: mkrset c CPUlist [-m MEMlist] rsetname (create a rset for 0: mkrset c 0-7 test/mcm0) Remove a rset: rmrset rsetname Display information about rset: lsrset [ -f] [ -v -o] [ -r rsetname -n namespace -a] Attach (detach) a process to a rset: attachrset [ -P ] [ -F] rsetname pid or [ -P] [ -F] [ -c CPUlist ] [ -m MEMlist ] pid detachrset [ -P ] pid Execute a command in a rset: execrset [ -P ] [ -F] -c CPUlist [ -mmemlist ] e command [ parameters ] or execrset [ -P ] [ -F] rsetname [ -e] command [ parameters ] Memory Affinity memory On pseries system the memory is attached to the s Local memory access is faster The target is to improve performance of HPC applications by backing the data in memory that is attached to containing the CPU If memory affinity is enabled, AIX managed memory as a set of pools (one per or SCM) Pools can be monitored by the following commands kdb > mempool * > vmpool > free 5

How to set Memory Affinity In AIX 5.1: vmtume y 1 or 0 (default), (+ bosboot a, reboot) Global availability for all processes: on or off In AIX 5.2: vmo memory_affinity=1 or 0 (default), (+ bosboot a, reboot) + variable environment MEMORY_AFFINTY provided memory affinity for a selected processed (also available with AIX 5.1G) Two valid settings for MEMORY_AFFINITY > MEMORY_AFFINTY= memory allocation is local per and paging page is global > MEMORY_AFFINTY=@LRU=EARLY both memory allocation is and paging space are local per No Memory Affinity 0: vmo -o memory_affinity=0 (AIX 5.2) Memory allocation is random by 4 kbytes page (or 16 Mbytes Large Page) Job 1 Job 2 Job 3 memory 6

Memory Affinity 1: vmo -o memory_affinity=1 (AIX 5.2) MEMORY_AFFINITY not set 4 kbytes pages is allocated in round-robin fashion across s memory 4 kbytes page (or 16 Mbytes Large Page) Memory Affinity 2: vmo -o memory_affinity=1 (AIX 5.2) MEMORY_AFFINITY= 0) process pages are assigned locally on pool containing the CPU 1) if the process is rescheduled - memory is not moved 2) new process pages are assigned locally memory Default AIX 5.1: vmtune y 1 New in 5.1G: vmtune y 2 7

Memory Affinity 3: vmo -o memory_affinity=1 (AIX 5.2) Not enough memory on the local pool MEMORY_AFFINITY= memory is taken from any pool memory Memory used Memory Affinity 4: vmo -o memory_affinity=1 (AIX 5.2, AIX 5.1G) Not enough memory on the local pool AIX 5.2B or AIX 5.1G Page in/out disk MEMORY_AFFINITY=@LRU=EARLY replace page from local pool as soon as the local pool has reached its low threshold (minfree) (set maxperm to a low value) memory Memory used 8

Impact of Memory affinity versus the number of Elapsed time fluctuations for parallel jobs versus the number of involved On an idle system, the AIX scheduler spreads processes or threads across s Differences between 1, 2 and 4 s runs P690 32 way 1.3 GHz Elapsed time (sec) difference ratio 700 600 4 2 155.0% 200% 150% 500 400 44.8% 1 % 4/2 % 4/1 100% MPI codes 300 200 13.0% 30.0% 38.0% 28.5% 28.5% 62.0% 6.6% 6.8% 10.6% 0.0% 8.3% 50% 0% 100-50% Pthreads codes 0 DYNAMO 8 DYNAMO 16 NTMIX 16 NSMD 16 CPMD 16 NSI 8 SDDNS 8 U1 8 U1 16-100% VSRAC: a tool integrating AIX resource capabilities in HPC environment AIX does not provide automatic tool to guarantee Affinity In production due to process rescheduling the memory bandwidth can be limited by inter bandwidth (lost of Memory Affinitty) AIX Affinity services (WLM, binding, rset, Memory affinity) are partially use by HPC environment (Loadleveler, Parallel Environment) PSSC Solution: VSRAC interface VSRAC allows multi resource allocation controls in an unique interface including AIX Resources capabilities: WLM, binding, rset, Memory affinity standard resource allocation policies (process placement) Internal workload management Interactive commands Interfaces with LoadLeveler and Parallel environment With environment variables, users or administrator can apply a allocation resource policy Available at http://tonga.mop.ibm.com/vsrac 9

Resource control interface (VSRAC) Cluster level User jobs Interactive jobs batch jobs vsrac commands Batch Software: LoadLeveler, PBS Parallel Environment: POE, Mpitch, System resource affinity control (VSRAC) OS Level Hardware level Memory affinity Operating System: UNIX System resource allocation capabilities provided by OS: Work Load Manager, Resource Rset, CPU Binding, Manual allocation resources WLM classes, rset, binding No global allocation CPU CPU CPU CPU Memory interconnect CPU CPU CPU CPU Memory Default UNIX dispatcher balances processes across s CPU CPU CPU CPU Memory... Hardware architecture system VSRAC resource allocation policies (POLICY_RAC) 3 types of policies RSET policies: jobs are attached to rset defined on s ( Affinity) rset_mcm processes are placed sequentially, minimizing the number of s used => reduce execution time fluctations rset_mcm_r processes are placed round robin allocation scheme MPI jobs: task 1 on 0, task 2 2, WLM policies: jobs are assign in WLM policies (not compatible with LoadLeveler ConsumableCpus) ll_wlm assigns LoadLeveler classes to WLM classes ll_wlm_rset or ll_wlm_rset_r ll_wlm + tasks placement on Bind policies: processes are bound on CPU bind_pr sequentially versus CPU number bind_pr_r round robin versus number 10

VSRAC Environment variables _AFFINITY [on off]: activates VSRAC tools POLICY_RAC: set resource allocation policies JOBTYPE_RAC: specifies job type (serail, mpi, OpenMP, ) THREADS_TASK_RAC: number of threads per process for multi threading job WORKLOAD_RAC [on off]: activates VSRAC internal workload management TARGET_RAC: specifies a list of s or CPUs to limit the job to use only these physical resources MP_PMDSUFFIX=vsrac: ppe variable to set vsrac interface for MPI jobs. VSRAC usage Interactive jobs can be managed by VSRAC under the following conditions: Serial, OpenMP or multithreaded jobs must be started with the vsrac driver command > vsrac program_name argument1 argument2 MPI jobs must be started with the mpp command. mpp is a wrapper script around the poe command that accepts all poe options. LoadLeveler jobs Loadleveler configuration the system administrator must add an user job prolog and epilog in the LoadL_config file. JOB_USER_PROLOG = /opt/vsrac/bin/prolog_vsrac.sh JOB_USER_EPILOG = /opt/vsrac/bin/epilog_vsrac.sh 11

AFFINITY benefits each application is launched several times with a policy of 1 process/cpu Without VSRAC control Differences between free througput and free runs and ratio Elaps e d time difference % (sec) 800 150 121 700 130 91 110 600 80 500 67 90 70 400 44 37 32 24 29 50 300 30 200 10 100-10 0-30 DYNAM O DYNAM O NTM IX 16 NSM D 16 CPM D 16 NSI 8 SDDNS 8 U1 8 U1 16 8 16 Throughput average (free) runs free ratio With VSRAC Control rset_mcm policy Differences between WLM controlled througput and WLM Elapsed time controlled runs and ratio difference % (sec) 800 150 700 130 600 110 500 90 400 70 300 50 200 12 12 16 11 9 13 30 8 3 0 100 10 0-10 DYNAM O DYNAM O NTM IX 16 NSMD 16 CPM D 16 NSI 8 SDDNS 8 U1 8 U1 16 8 16 Throughput average (w ith WLM) runs controlled w ith WLM ratio P690 32 way 1.3 GHz 12