A New Quality of Service (QoS) Policy for Lustre Utilizing the Lustre Network Request Scheduler (NRS) Framework



Similar documents
Lustre Monitoring with OpenTSDB

Lustre tools for ldiskfs investigation and lightweight I/O statistics

Improving Quality of Service

The network we see so far. Internet Best Effort Service. Is best-effort good enough? An Audio Example. Network Support for Playback

Cray Lustre File System Monitoring

This topic lists the key mechanisms use to implement QoS in an IP network.

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS?

Quality of Service. Traditional Nonconverged Network. Traditional data traffic characteristics:

Sources: Chapter 6 from. Computer Networking: A Top-Down Approach Featuring the Internet, by Kurose and Ross

Configuring QoS in a Wireless Environment

Distributed Systems 3. Network Quality of Service (QoS)

Uncovering degraded application performance with LWM 2. Aamer Shah, Chih-Song Kuo, Lucas Theisen, Felix Wolf November 17, 2014

Application Performance for High Performance Computing Environments

Real-time apps and Quality of Service

Quality of Service Analysis of site to site for IPSec VPNs for realtime multimedia traffic.

TECHNICAL NOTE. FortiGate Traffic Shaping Version

16-PORT POWER OVER ETHERNET WEB SMART SWITCH

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Motivation. QoS Guarantees. Internet service classes. Certain applications require minimum level of network performance:

RESOURCE ALLOCATION FOR INTERACTIVE TRAFFIC CLASS OVER GPRS

COMPARATIVE ANALYSIS OF DIFFERENT QUEUING MECHANISMS IN HETROGENEOUS NETWORKS

AFDX networks. Computers and Real-Time Group, University of Cantabria

Lustre Networking BY PETER J. BRAAM

Quality of Service (QoS): Managing Bandwidth More Effectively on the Series 2600/2600-PWR and Series 2800 Switches

Quality of Service in the Internet. QoS Parameters. Keeping the QoS. Traffic Shaping: Leaky Bucket Algorithm

CS640: Introduction to Computer Networks. Why a New Service Model? Utility curve Elastic traffic. Aditya Akella. Lecture 20 QoS

POWER ALL GLOBAL FILE SYSTEM (PGFS)

GE Measurement & Control. Cyber Security for NEI 08-09

Application Performance Analysis and Troubleshooting

6.6 Scheduling and Policing Mechanisms

Quality of Service (QoS)) in IP networks

A Shared File System on SAS Grid Manger in a Cloud Environment

QoS in PAN-OS. Tech Note PAN-OS 4.1. Revision A 2011, Palo Alto Networks, Inc.

QoS for (Web) Applications Velocity EU 2011

The Moab Scheduler. Dan Mazur, McGill HPC Aug 23, 2013

Table of Contents. Cisco How Does Load Balancing Work?

The State of OpenFlow: Advice for Those Considering SDN. Steve Wallace Executive Director, InCNTRE SDN Lab Indiana University

1. The subnet must prevent additional packets from entering the congested region until those already present can be processed.

Multimedia Requirements. Multimedia and Networks. Quality of Service

vsphere Networking vsphere 6.0 ESXi 6.0 vcenter Server 6.0 EN

Network Function Virtualization

Optimizing Converged Cisco Networks (ONT)

Iperf Tutorial. Jon Dugan Summer JointTechs 2010, Columbus, OH

QoS Queuing on Cisco Nexus 1000V Class-Based Weighted Fair Queuing for Virtualized Data Centers and Cloud Environments

The Google File System

Internet Quality of Service

Fireware XTM Traffic Management

Smart Queue Scheduling for QoS Spring 2001 Final Report

The IP Transmission Process. V1.4: Geoff Bennett

isco Troubleshooting Input Queue Drops and Output Queue D

A Shared File System on SAS Grid Manager * in a Cloud Environment

Worst Case Analysis - Network Calculus

Frame Metering in 802.1Q Version 01

QoS Parameters. Quality of Service in the Internet. Traffic Shaping: Congestion Control. Keeping the QoS

Fuzzy Active Queue Management for Assured Forwarding Traffic in Differentiated Services Network

Virtual Server in SP883

QoS: Color-Aware Policer

StreamServe Persuasion SP5 Microsoft SQL Server

ENTERPRISE INFRASTRUCTURE CONFIGURATION GUIDE

A High Performance Computing Scheduling and Resource Management Primer

1. Simulation of load balancing in a cloud computing environment using OMNET

IP videoconferencing solution with ProCurve switches and Tandberg terminals

Three Key Design Considerations of IP Video Surveillance Systems

The Network Layer Functions: Congestion Control

Lecture 16: Quality of Service. CSE 123: Computer Networks Stefan Savage

Extensible Network Configuration and Communication Framework

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

Windows Server Performance Monitoring

Firewall. Vyatta System. REFERENCE GUIDE IPv4 Firewall IPv6 Firewall Zone Based Firewall VYATTA, INC.

AN APPLICATION OF INFORMATION RETRIEVAL IN P2P NETWORKS USING SOCKETS AND METADATA

Quality of Service (QoS) on Netgear switches

IBM. Tivoli. Netcool Performance Manager. Cisco Class-Based QoS Technology Pack. User Guide. Document Revision R2E1

STANDPOINT FOR QUALITY-OF-SERVICE MEASUREMENT

Lustre performance monitoring and trouble shooting

Lustre * Filesystem for Cloud and Hadoop *

EView/400i Management Pack for Systems Center Operations Manager (SCOM)

Cisco Integrated Services Routers Performance Overview

Quality of Service (QoS) for Enterprise Networks. Learn How to Configure QoS on Cisco Routers. Share:

Configuring QoS in a Wireless Environment

Apache Hadoop. Alexandru Costan

Computer Networks I Laboratory Exercise 1

QUEUES. Primitive Queue operations. enqueue (q, x): inserts item x at the rear of the queue q

Technical Bulletin. Arista LANZ Overview. Overview

ESM s management across multi-platforms eliminates the need for various account managers.

Traffic Mirroring Commands on the Cisco IOS XR Software

Small is Better: Avoiding Latency Traps in Virtualized DataCenters

Edgewater Routers User Guide

Current Status of FEFS for the K computer

FortiOS Handbook - Traffic Shaping VERSION 5.2.0

Traffic Control in a Linux, Multiple Service Edge Device

An Oracle Technical White Paper November Oracle Solaris 11 Network Virtualization and Network Resource Management

Transcription:

2013/09/17 A New Quality of Service (QoS) Policy for Lustre Utilizing the Lustre Network Request Scheduler (NRS) Framework Shuichi Ihara DataDirect Networks Japan

Background: Why QoS? Lustre throughput and metadata performance scales very well with the number of OSTs/OSSs and with DNE the number of MDTs/MDSs Lustre performance is well balanced But, as of today, Lustre does not offer the option to manage performance or to limit performance Lustre is increasingly entering application areas outside the mainstream HPC with its large parallel application ( file-perprocess ) use cases Some of these use cases require system administrators to manage (limit, increase, prioritize) performance Eventually, approaches to deal with these use cases will also benefit mainstream HPC centers! 2

Quality of Service (QoS) Quality of Service (QoS) is a mechanism to ensure a "guaranteed performance QoS was developed mostly in the network world, and especially, on TCP/IP networks, which pose specific QoS challenges QoS features are available on many network hardware or network management software products QoS is somewhat less common in the storage world, although some (expensive) enterprise storage products claim QoS or QoS-like features 3

LQS: A QoS Policy for Lustre We have developed a policy layer called Lustre QoS (LQS) that can provide QoS by controlling the number of RPCs handled on the Lustre servers LQS runs as a policy of the Network Request Scheduler (NRS) Eventually, LQS limits Lustre performance by limiting the number of bandwidth and/or metadata operations 4

The Network Request Scheduler (NRS) A new component of the PTLRPC service NRS works on the server side and allows handling of incoming RPCs before passing them to the OSS/MDS threads for the backend file system This framework, together with a few policy options, was merged into the Lustre mainstream and has been available since Lustre 2.4.0 The NRS Framework is very flexible and it is fairly easy and straightforward to add new policies Policies can manage RPCs based on NID and UID/GID/ JOBID, etc.. 5

QoS Algorithms Many types of QoS algorithms have been developed over the past few decades The Token Bucket Filter (TBF) is a major algorithm used in general network systems It's simple and easy to implement Many Ethernet switches and routers use TBF to enable QoS features TBF can accommodate very small burst traffic, but is also OK for long-term data transmission 6

The Token Bucket Filter (TBF) Tokens replenished at rate R(t) Token rate R(t) Bucket depth B When input data arrives, but if no token available, it waits until enough tokens are ready. In some case, very small burst traffic B 7 INPUT N input data per N tokens OUTPUT Transfer Rate O(t) O(t) R(t) + B

TBF Implementation for Lustre Enqueue based on ID FIFO queues Token buckets Dequeue based on deadlines Incoming RPC Tokens Handling RPC 8

TBF patches for Lustre LU-3558 ptlrpc: Add the NRS TBF policy Main TBF code for NRS-based policy LU-3495 ptlrpc: Add rate counter for request handling New counters in /proc to show request handing LU-3494 libcfs: Add relocation function to libcfs heap Added a function to efficiently change the rank of queue 9

UseCase #1 Lustre QoS based on NIDs (Clients) Submitted jobs UserA Job-X High priority performance Lustre clients LQS High priority Clients...... Job-Y LQS Low priority Clients High bandwidth Total available bandwidth Limited (Lower) bandwidth... 10 Lustre Server MDS/OSS

UseCase #2 Lustre QoS based on JOBID UserA+JOB-X is high priority of Lustre performance. UserA Job-X High bandwidth Submitted jobs Low bandwidth Job-Y Clients... UserB Maximum Lustre bandwidth... 11 Lustre Server MDS/OSS

How to use Lustre QoS u Change NRS policy to TBF with NID # lctl set_param ost.oss.ost_io.nrs_policies="<nrs policy> <TBF argument>" # lctl set_param ost.oss.ost_io.nrs_policies="tbf nid" u Set rule with classification and number of token rate # lctl set_param ost.oss.ost_io.nrs_tbf_rule="start <TBF's rule name> {NID} <rate>" # lctl set_param ost.oss.ost_io.nrs_tbf_rule="start rule_client1 {192.168.1.1@o2ib} 1" # lctl set_param ost.oss.ost_io.nrs_tbf_rule="start rule_clients {192.168.1.[2-16]@o2ib} 10" u Change number of token rate # lctl set_param ost.oss.ost_io.nrs_tbf_rule="change <TBF's rule name> <new rate>" # lctl set_param ost.oss.ost_io.nrs_tbf_rule="change rule_client1 100" u Stop a rule (delete) # lctl set_param ost.oss.ost_io.nrs_tbf_rule="stop <TBF's rule name>" # lctl set_param ost.oss.ost_io.nrs_tbf_rule="stop rule_client1" 12

Lustre QoS for Clients Test results "dd" command to single OST from multiple clients with various QoS rules (Write, 1MB IO, max_rpc_in_flright=32) 13

QoS for UID with Jobstats Test result Start JOBstats and changed NRS policy to TBF with JOBID # lctl set_param jobid_var=procname_uid # lctl set_param ost.oss.ost_io.nrs_policies="tbf jobid" Set rule with classification and number of token # lctl set_param ost.oss.ost_io.nrs_tbf_rule="start <TBF's rule name> {JOBID} <rate>" # lctl set_param ost.oss.ost_io.nrs_tbf_rule="start iozone_user1 {iozone.500} 1" Change number of token # lctl set_param ost.oss.ost_io.nrs_tbf_rule="change iozone_user1 X" (change X to 10,50 and 100) Stop a rule (delete) # lctl set_param ost.oss.ost_io.nrs_tbf_rule="stop iozone_user1" user1's(uid=500) iozone(1m, Write) 700 600 500 stop default #token MB/sec 400 300 200 100 start token=1 token=10 token=50 token=100 0 14 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

Conclusion We adapted a standard Token Bucket Filter (TBF) algorithm to Lustre and implemented LQS, a QoS policy based on the Lustre Network Request Scheduler (NRS) framework We demonstrated that it is possible to manage Lustre performance selectively and with high accuracy, discriminating by NID or JOBID. (We are still more testing!) As of today, this approach only supports simple QoS rules with only a single discriminator present at any time A rule with multiple discriminators appears possible, but is still under investigation Changes in the NRS framework itself may be necessary to implement more complex policies with multiple discriminators 15

Thank you! 16