PeerMon: A Peer-to-Peer Network Monitoring System



Similar documents
PeerMon: A Peer-to-Peer Network Monitoring System

Mark Bennett. Search and the Virtual Machine

SIDN Server Measurements

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Network Attached Storage. Jinfeng Yang Oct/19/2015

Web DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing)

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

High Availability Essentials

Pros and Cons of HPC Cloud Computing

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

SCALABILITY AND AVAILABILITY

Optimizing Shared Resource Contention in HPC Clusters

Maximizing SQL Server Virtualization Performance

S 2 -RAID: A New RAID Architecture for Fast Data Recovery

Selling Virtual Private Servers. A guide to positioning and selling VPS to your customers with Heart Internet

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Resource Utilization of Middleware Components in Embedded Systems

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop Architecture. Part 1

Lecture 36: Chapter 6

Chapter 2: Remote Procedure Call (RPC)

Muse Server Sizing. 18 June Document Version Muse

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

NoSQL Data Base Basics

Cloud Based Application Architectures using Smart Computing

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Copyright 1

Step by Step Guide To vstorage Backup Server (Proxy) Sizing

Chapter 18: Database System Architectures. Centralized Systems

- An Essential Building Block for Stable and Reliable Compute Clusters

LinuxWorld Conference & Expo Server Farms and XML Web Services

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

Building a Highly Available and Scalable Web Farm

CSE-E5430 Scalable Cloud Computing Lecture 11

Assignment One. ITN534 Network Management. Title: Report on an Integrated Network Management Product (Solar winds 2001 Engineer s Edition)

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

The VMware Administrator s Guide to Hyper-V in Windows Server Brien Posey Microsoft

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Sonatype Nexus Professional

MOSIX: High performance Linux farm

Giving life to today s media distribution services

Cloud Optimize Your IT

Silver Peak Virtual Appliances

Scalable Internet Services and Load Balancing

Big Fast Data Hadoop acceleration with Flash. June 2013

System Models for Distributed and Cloud Computing

Automatic load balancing and transparent process migration

Application Note. Windows 2000/XP TCP Tuning for High Bandwidth Networks. mguard smart mguard PCI mguard blade

HPC performance applications on Virtual Clusters

Design and Evolution of the Apache Hadoop File System(HDFS)

Network Monitoring On Large Networks. Yao Chuan Han (TWCERT/CC)

Understanding the Benefits of IBM SPSS Statistics Server

Open Source in Network Administration: the ntop Project

Parallel Computing. Benson Muite. benson.

Large-Scale Web Applications

Intellicus Enterprise Reporting and BI Platform

Client/server and peer-to-peer models: basic concepts

SO_REUSEPORT Scaling Techniques for Servers with High Connection Rates. Ying Cai

Internet Protocol: IP packet headers. vendredi 18 octobre 13

Network performance in virtual infrastructures

Operating Systems 4 th Class

MPI and Hybrid Programming Models. William Gropp

Scheduling using Optimization Decomposition in Wireless Network with Time Performance Analysis

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware

Data Centers and Cloud Computing. Data Centers

Running a Workflow on a PowerCenter Grid

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Moving Virtual Storage to the Cloud

PANDORA FMS NETWORK DEVICE MONITORING

Performance Analysis of IPv4 v/s IPv6 in Virtual Environment Using UBUNTU

Distributed Systems. 2. Application Layer

Network File System (NFS) Pradipta De

CSIS CSIS 3230 Spring Networking, its all about the apps! Apps on the Edge. Application Architectures. Pure P2P Architecture

Course Description and Outline. IT Essential II: Network Operating Systems V2.0

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

OpenMP Programming on ScaleMP

VMware vsphere 4.1 with ESXi and vcenter

From Centralization to Distribution: A Comparison of File Sharing Protocols

How To Improve Performance On A Single Chip Computer

Load Imbalance Analysis

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

VMWARE WHITE PAPER 1

ZEN LOAD BALANCER EE v3.04 DATASHEET The Load Balancing made easy

OGF25/EGEE User Forum Catania, Italy 2 March 2009

Summer Student Project Report

Transcription:

PeerMon: A Peer-to-Peer Network Monitoring System Tia Newhall, Janis Libeks, Ross Greenwood, Jeff Knerr Computer Science Department Swarthmore College Swarthmore, PA USA newhall@cs.swarthmore.edu

Target: General Purpose NWs Usually single LAN systems Each machine s resources controlled by local OS NFS, but little other system-wide resource sharing No central scheduler of NW-wide resources Users tend to statically pick node(s) to use (ex) write MPI hostfile once, use every time Users may not have a choice (ex) ssh cs.swarthmore.edu: target is chosen from static set Often large imbalances in NW-wide resource usage 2

Imbalances Cause Poor Performance Swapping on some while lots of free RAM on others Large variations in CPU loads Variations in contention for NIC, disk, other devices Parallel applications (ex. MPI) Usually performance determined by slowest node Picking one overloaded node can result in big performance hit Sequential applications Low response rate for interactive jobs Longer execution times for batch jobs 3

Want to do better load balancing Tool to easily and quickly discover good nodes low CPU load, enough free RAM, fewest number of processes, total # CPUs, use to make better job/process placement get better load balancing avoid problems with load imballances But has to fit with constraints of target system Still General Purpose system where each OS manages it local node s resources Not implementing a global resource scheduler 4

PeerMon P2P Resource Monitoring System Scalable, fault tolerant, low overhead system No central authority, so no single bottleneck nor single point of failure Each node runs equal peer that provides system-wide resource usage data to local users on its node Fast local access to system-wide resource usage data Layered Architecture: PeerMon does the system-wide data collection part Higher-level services use PeerMon data to do load balancing, job placement, 5

PeerMon Architecture Every node runs equal peer that collects system-wide resource usage data Sender and Listener Threads: communicate over P2P NW Client Interface Thread: exports PeerMon data to higher-level services that use it (communicate with local peermon daemon only!) 6

Listener and Sender Threads Listener Thread: receives resource usage data from other peers updates its system-wide resource usage data (stored in hashmap) Sender Thread: periodically wakes up & sends its data about whole system to 3 peers Both use UDP/IP Fast, don t need reliable delivery Single UDP socket vs. one per connection w/tcp 7

Resource Usage Data Each PeerMon peer: Collects info about its own node Sends its full hashmap data to 3 peers Cycle through different heuristics to choose 3 to ensure full conectivity & that new nodes get quickly integrated Receives info about other nodes from some of its peers Constraints on PeerMon Peer s Data: Doesn t need to be consistent across peers With good messaging heuristics it is close to consistent If higher-level service requires an absolute authority, then it can choose 1 PeerMon node to be that authority No different from centralized SNMP systems 8

Why send to 3 peers? Ave. Data Age NW Bandwidth Results for a 500 node system Sending to 3 peers is good trade-off in Data Age vs. NW overheads 9

Client Thread Local PeerMon daemon provides all system-wide data to local users currently TCP interface If a higher-level service requires an absolute authority, then it can interact with exactly one PeerMon daemon or implement distributed consensus w/more than one For services that don t need absolute agreement, interact with local PeerMon daemon => purely distributed interaction 10

System start-up New peermon process gets 3 peer IPs config file Sender thread sends data to 3 peers to connect to P2P NW If at least 1 of 3 eventually runs peermon, new node will enter PeerMon P2P NW 11

Fault Tolerance and Recovery When a node fails or becomes unreachable, its data ages out of the system Users of PeerMon data at other nodes will not choose failed node as one of the good nodes Recovery: No different from start-up No global state that needs to be reconstructed, new peermon deamon will enter P2P NW and begin receiving system-wide resource usage data 12

Example Uses of PeerMon SmarterSSH: Uses PeerMon data to pick best ssh target autompigen Generates MPI hostfile, choosing best nodes based on PeerMon data Dynamic DNS mapping Dynamically binds name to one of current set of best nodes Uses RR in BIND 9 to rotate through set of top N machines periodically updated by PeerMon 13

SmarterSSH and autompigen Simple Python Programs, use PeerMon client TCP interface Can order best nodes based on CPU load, amount free RAM, or combination of both Uses a delta value in ordering nodes so small diffs in load are not significant to ordering smarterssh randomizes the order of equally good nodes so subsequent quick invocations distribute ssh load over set of best nodes 14

Example smarterssh commands 15

How much does PeerMon help? Three benchmark programs: 1. Memory Intensive sequential program 2. CPU intensive OpenMP program (single node) 3. RAM&CPU intensive parallel MPI program (ran on 8 of 50 nodes) Experiments comparing: Runs on randomly selected node(s): no PeerMon Nodes chosen using PeerMon data with: Ordered by CPU only Ordered by available RAM only Ordered using both CPU load and available RAM 16

Speed-up of PeerMon vs Random Node Ranking Sequential (RAM Intensive) OpenMP (CPU Intensive) 8 node MPI (Both) CPU only 0.87 1.63 1.27 RAM only 4.62 2.19 1.78 CPU & RAM 4.62 2.29 1.83 + Using PeerMon significantly improves performance random only does better when PeerMon ordering criterion is bad match for application + Combination of CPU&RAM best ordering criterion 17

Scalability of PeerMon Tested PeerMon NWs of 2-2,200 nodes Collected traces of MRTG data for CPU, RAM, NW bandwidth Results: Per node CPU and RAM Usage remains constant Per node NW bandwidth grows slightly with size of P2P NW, but still very small Up to.16 Mbit/s for 2,200 node system Each node sends information about every node in NW, so as PeerMon NW grows, so does amt data 18

Conclusions PeerMon: P2P, low overhead, scalable, faulttolerant resource monitoring system for general purpose LANs It provides system-wide resource usage data and an interface to export data to higher-level tools and services Our example tools that use PeerMon data provide some load balancing in general purpose NW systems and result in significant improvements in performance 19

Future Work Release beta version under GPL we hope before end of summer www.cs.swarthmore.edu/~newhall/peermon Further investigate security & scalability issues PeerMon that spans multiple LANs? Implement easier to use client interface Add extensibility interface to change set of system resource monitored and how Implement more tools that use PeerMon 20