Birmingham Status Report. HEP System Managers Meeting, 29 th June, 2011 Mark Slater, Birmingham University

Similar documents
Status of Grid Activities in Pakistan. FAWAD SAEED National Centre For Physics, Pakistan

Report from SARA/NIKHEF T1 and associated T2s

Retirement of glite3.1 and unsupported glite 3.2

Computing in High- Energy-Physics: How Virtualization meets the Grid

Virtualization Infrastructure at Karlsruhe

Integration of Virtualized Worker Nodes in Standard-Batch-Systems CHEP 2009 Prague Oliver Oberst

Intro to Virtualization

Deployment Guide. How to prepare your environment for an OnApp Cloud deployment.

Server Virtualization with Windows Server Hyper-V and System Center

Deploying Windows Streaming Media Servers NLB Cluster and metasan

A quantitative comparison between xen and kvm

Configuration of a Load-Balanced and Fail-Over Merak Cluster using Windows Server 2003 Network Load Balancing

Cloud Service SLA Declaration

ABB Technology Days Fall 2013 System 800xA Server and Client Virtualization. ABB Inc 3BSE en. October 29, 2013 Slide 1

FermiGrid Highly Available Grid Services

SCF/FEF Evaluation of Nagios and Zabbix Monitoring Systems. Ed Simmonds and Jason Harrington 7/20/2009

Long term analysis in HEP: Use of virtualization and emulation techniques

Managing, Maintaining Data in a Virtual World

CommandCenter Secure Gateway

Course Outline. Create and configure virtual hard disks. Create and configure virtual machines. Install and import virtual machines.

IGNITE Cloud Server enhancements

Test Report. Microsoft Windows 2003 Server Network Load Balancer Setup for SuperGIS Server 3. Published by: SuperGeo Technologies Inc.

ActiveImage Protector 3.5 for Hyper-V with SHR. User Guide - Back up Hyper-V Server 2012 R2 host and

Brocade Network Advisor High Availability Using Microsoft Cluster Service

Server Virtualization with Windows Server Hyper-V and System Center

City of Georgetown. Network Monitoring (Month of July) Report for

Monitoring the Grid at local, national, and global levels

70-414: Implementing a Cloud Based Infrastructure. Course Overview

Virtual Clusters as a New Service of MetaCentrum, the Czech NGI

Regional SEE-GRID-SCI Training for Site Administrators Institute of Physics Belgrade March 5-6, 2009

Service Level Agreement

HTML Code Generator V 1.0 For Simatic IT Modules CP IT, IT, IT

Solution for private cloud computing

AKIPS Network Monitor User Manual (DRAFT) Version 15.x. AKIPS Pty Ltd

CERN local High Availability solutions and experiences. Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN

The Revised SAP J2EE Engine 6.20 Cluster Architecture

HP ProLiant Cluster for MSA1000 for Small Business Hardware Cabling Scheme Introduction Software and Hardware Requirements...

Windows Vista /Windows 7 Installation Guide

Sun Grid Engine, a new scheduler for EGEE

Server Virtualization with Windows Server Hyper-V and System Center

Networking. Sixth Edition. A Beginner's Guide BRUCE HALLBERG

Das HappyFace Meta-Monitoring Framework

Intellicus Enterprise Reporting and BI Platform

Implementing and Managing Microsoft Server Virtualization

HTCondor at the RAL Tier-1

ZEN LOAD BALANCER EE v3.04 DATASHEET The Load Balancing made easy

IOS110. Virtualization 5/27/2014 1

High Availability and Clustering

Configuring Network Load Balancing with Cerberus FTP Server

How to Configure a High Availability Cluster in Azure via Web Portal and ASM

uh6 efolder BDR Guide for Veeam Page 1 of 36

QNAP in vsphere Environment

Dell Compellent Storage Center

SCALABILITY AND AVAILABILITY

Red Hat Enterprise Virtualization - KVM-based infrastructure services at BNL

ZEN LOAD BALANCER EE v3.02 DATASHEET The Load Balancing made easy

Agenda. Begining Research Project. Our problems. λ The End is not near...

Enterprise Deployment: Laserfiche 8 in a Virtual Environment. White Paper

Integrating Data Protection Manager with StorTrends itx

EVault for Data Protection Manager. Course 401 EDPM Troubleshooting Basics

Veeam ONE What s New in v9?

IT Test - Server Administration

ONLINE BACKUP MANAGER MS EXCHANGE MAIL LEVEL BACKUP

CY-01-KIMON operational issues

RDS Online Backup Suite v5.1 Brick-Level Exchange Backup

Transferring AIS to a different computer

Rentavault Online Backup. MS Exchange Mail Level Backup

Configuring a VEEAM off host backup proxy server for backing up a Windows Server 2012 R2 Hyper-V cluster with a DELL Compellent SAN (Fiber Channel)

An objective comparison test of workload management systems

10751-Configuring and Deploying a Private Cloud with System Center 2012

For additional information and steps for installing and upgrading software, please see the Upgrade manual.

Windows Vista Installation Guide

Fault Tolerant Solutions Overview

This section describes how to set up, find and delete community strings.

Using FlashNAS ZFS. Microsoft Hyper-V Server Live Migration

Enhancing the Scalability of Virtual Machines in Cloud

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Databoks Remote Backup. MS Exchange Mail Level Backup

Preparation Guide. How to prepare your environment for an OnApp Cloud v3.0 (beta) deployment.

Dedicated Hosting. The best of all worlds. Build your server to deliver just what you want. For more information visit: imcloudservices.com.

1 Download & Installation Usernames and... Passwords

Using DBMoto 7 in a Microsoft Windows Cluster

Science Days Stefan Freitag. 03. November Robotics Research Institute Dortmund University of Technology. Cloud Computing in D-Grid

PES. Ermis service for DNS Load Balancer configuration. HEPiX Fall Aris Angelogiannopoulos, CERN IT-PES/PS Ignacio Reguero, CERN IT-PES/PS

Clusters in the Cloud

Optimization, Business Continuity & Disaster Recovery in Virtual Environments. Darius Spaičys, Partner Business manager Baltic s

13.1 Backup virtual machines running on VMware ESXi / ESX Server

Solcon Online Backup. MS Exchange Mail Level Backup

Server-Virtualisierung mit Windows Server Hyper-V und System Center MOC 20409

Server Virtualization with Windows Server Hyper-V and System Center

How to create custom dashboards in SNMPc OnLine 2007

Overview Customer Login Main Page VM Management Creation... 4 Editing a Virtual Machine... 6

Solution for private cloud computing

801.11n Wireless Broadband Router

Windows Vista / Windows 7 Installation Guide

LBNC and IBM Corporation Document: LBNC-Install.doc Date: Path: D:\Doc\EPFL\LNBC\LBNC-Install.doc Version: V1.0

3.5 Mobile LAN Disk User Guide

High Availability Essentials

20409B: Server Virtualization with Windows Server Hyper-V and System Center

Distributed Computing for CEPC. YAN Tian On Behalf of Distributed Computing Group, CC, IHEP for 4 th CEPC Collaboration Meeting, Sep.

Transcription:

Birmingham Status Report HEP System Managers Meeting, 29 th June, 2011 Mark Slater, Birmingham University 1

Current Hardware At Birmingham we have a local cluster and storage as well as CPUs on a larger shared cluster The specs for each are: 24 8-core machines (192 job slots) @ 9.61-HEP-SPEC06 (Local) 48 4-core machines (192 jobs slots) @ 7.93-HEP-SPEC06 (Shared) 177.35T of DPM storage across 4 pool nodes There are 4 CEs (2 CREAM and 2 LCG) controlling the clusters with one SE controlling the storage All machines are SL5 and are running glite 3.2 with the exception of the LCG CEs which are still glite 3.1 (should these be switched off yet?) The only significant recent change (other than in personnel!) was shifting the MySQL server on to a VM from an old machine. There is now only one machine that's out of warranty (cfengine/dhcp head node) 2

Networking (I'm afraid this will be a little lacking in detail as I'm still learning and both Lawrie and I have been on holiday the last two weeks!) Our local cluster is on a well defined subnet (not sure the IP range, but small!) The shared cluster is also on a subnet, however, this subnet also contains other parts of the cluster I believe we get through to JANET via the main University hub though I'm not sure on speeds I'll go into monitoring in more detail later but essentially we use Ganglia for the majority of online network monitoring I'm hoping to improve this in the near future 3

Schematic of the System (1) LCG-CE Software Storage Nodes LCG-CE CREAM + Torque DPM Head CREAM Local Cluster Shared Torque Software Shared Cluster Local Machines (full access) Shared Cluster Machines (no root Access) 4

Schematic of the System (2) We have the majority of our service nodes running as virtual machines (Xen): SC Cream Monitoring Alice VOBox SC LCG-CE epgpe01 epgpe02 ATLAS Squid SC LCG-CE Torque+Cream (Local) ARGUS epgpe03 epgpe04 Site BDII SC Alice VOBox Apel MySQL epgpe10 5

Accounting We've managed to keep the site relatively stable over the last few months so we are currently enjoying the best figures for Birmingham yet: Failed disk cost us ~5 days 6

Monitoring We primarily use a combination of Ganglia and PBSWebMon for monitoring and cfengine for controlling the nodes where possible 7

Recent Problems (1) Failed Disk in May One of the disks on our host machines (pe10 hosting site BDII, APEL, VOBox and MySQL) started failing Testing seemed to show the disk was OK so it might have been the controller However, Dell replaced the disk and everything was restored from a direct backup of the whole disk taken before replacement SE Overloading The SE started 'stalling' and would only stay up for 10mins after restart This was found to be linked to excessive H1 jobs and direct transfers and also highlighted an issue with the MOAB config on the shared cluster. Slow ATLAS setup We also have an on-going problem of multiple Atlas user jobs on a machine being very slow to setup not sure if this is NFS related or load related 8

Recent Problems (2) Overloading CEs? Recently, I've noticed that several jobs can be stuck in waiting in Torque This usually happens when we have a large number of jobs be submitted at once and checking Ganglia at these times reveals high load on the machine Would splitting the Torque and Cream CE on to separate machines help? ATLAS releases In May we started getting frequently ticketed by Atlas saying that we didn't have releases installed when Panda said we did This was due to install jobs from Atlas only going through one CE and therefore only installing on one software area (the BDII was reporting OK) Solved by getting Allesandro de Salvo to make sure multiple jobs were sent and liaising with Alden Stradlin to update the settings for Bham in Panda 9

Over the next few months, we plan to do the following: Future Plans Get glexec working Everything has been installed but local tests are failing at present Improve network monitoring As Ganglia doesn't give much detail, we will try to improve this, e.g. cacti Deploy CVMFS This should hopefully sort out our release issues and the 'slow setup' issue Improve job monitoring It would be nice to analyse job stats better (e.g. wall time, cpu time, etc.) 10