CERN local High Availability solutions and experiences. Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN 16.06.2006



Similar documents
Database Services for CERN

Techniques for implementing & running robust and reliable DB-centric Grid Applications

Report from SARA/NIKHEF T1 and associated T2s

LHC schedule: what does it imply for SRM deployment? CERN, July 2007

High Availability Databases based on Oracle 10g RAC on Linux

Tushar Joshi Turtle Networks Ltd

Tier0 plans and security and backup policy proposals

High Availability and Clustering

Distributed Database Access in the LHC Computing Grid with CORAL

PES. Ermis service for DNS Load Balancer configuration. HEPiX Fall Aris Angelogiannopoulos, CERN IT-PES/PS Ignacio Reguero, CERN IT-PES/PS

Virtualisation Cloud Computing at the RAL Tier 1. Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013

Analisi di un servizio SRM: StoRM

Building Reliable, Scalable AR System Solutions. High-Availability. White Paper

short introduction to linux high availability description of problem and solution possibilities linux tools

PES. High Availability Load Balancing in the Agile Infrastructure. Platform & Engineering Services. HEPiX Bologna, April 2013

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Evolution of Database Replication Technologies for WLCG

Batch and Cloud overview. Andrew McNab University of Manchester GridPP and LHCb

DNS LOAD BALANCING AND FAILOVER MECHANISM AT CERN

Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft. Holger Marten. Holger. Marten at iwr. fzk. de

Chapter 10: Scalability

Mass Storage System for Disk and Tape resources at the Tier1.

ALICE GRID & Kolkata Tier-2

29/07/2010. Copyright 2010 Hewlett-Packard Development Company, L.P.

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence

HAMBURG ZEUTHEN. DESY Tier 2 and NAF. Peter Wegner, Birgit Lewendel for DESY-IT/DV. Tier 2: Status and News NAF: Status, Plans and Questions

Availability Digest. Redundant Load Balancing for High Availability July 2013

Alternative models to distribute VO specific software to WLCG sites: a prototype set up at PIC

Experience and Lessons learnt from running High Availability Databases on Network Attached Storage

Multiple Public IPs (virtual service IPs) are supported either to cover multiple network segments or to increase network performance.

Managing a Tier-2 Computer Centre with a Private Cloud Infrastructure

Data Management Plan (DMP) for Particle Physics Experiments prepared for the 2015 Consolidated Grants Round. Detailed Version

Eloquence Training What s new in Eloquence B.08.00

U-LITE Network Infrastructure

<Insert Picture Here> WebLogic High Availability Infrastructure WebLogic Server 11gR1 Labs

DNS ROUND ROBIN HIGH-AVAILABILITY LOAD SHARING

Linux High Availability

SAN TECHNICAL - DETAILS/ SPECIFICATIONS

Common Server Setups For Your Web Application - Part II

OVERVIEW. CEP Cluster Server is Ideal For: First-time users who want to make applications highly available

Step-by-Step Guide. to configure Open-E DSS V7 Active-Active iscsi Failover on Intel Server Systems R2224GZ4GC4. Software Version: DSS ver. 7.

How To Run A Web Farm On Linux (Ahem) On A Single Computer (For Free) On Your Computer (With A Freebie) On An Ipv4 (For Cheap) Or Ipv2 (For A Free) (For

Westek Technology Snapshot and HA iscsi Replication Suite

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

UN 4013 V - Virtual Tape Libraries solutions update...

Load Balancing Web Applications

Service Challenge Tests of the LCG Grid

Learn Oracle WebLogic Server 12c Administration For Middleware Administrators

Brian LaGoe, Systems Administrator Benjamin Jellema, Systems Administrator Eastern Michigan University

Computing in High- Energy-Physics: How Virtualization meets the Grid

CHAPTER 2 BACKGROUND AND OBJECTIVE OF PRESENT WORK

Creating Web Farms with Linux (Linux High Availability and Scalability)

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

Open-Xchange Server High Availability

Michael Thomas, Dorian Kcira California Institute of Technology. CMS Offline & Computing Week

High Availability & Disaster Recovery Development Project. Concepts, Design and Implementation

If you already have your SAN infrastructure in place, you can skip this section.

High Availability Technical Notice

High Availability Solutions for the MariaDB and MySQL Database

ArcGIS for Server Deployment Scenarios An ArcGIS Server s architecture tour

Deployment Topologies

ARDA Experiment Dashboard

Going Hybrid. The first step to your! Enterprise Cloud journey! Eric Sansonny General Manager!

Regional SEE-GRID-SCI Training for Site Administrators Institute of Physics Belgrade March 5-6, 2009

Status and Evolution of ATLAS Workload Management System PanDA

Cisco Active Network Abstraction Gateway High Availability Solution

MySQL High-Availability and Scale-Out architectures

Contents. SnapComms Data Protection Recommendations

Comparing TCO for Mission Critical Linux and NonStop

High Availability Solutions for MySQL. Lenz Grimmer DrupalCon 2008, Szeged, Hungary

IP Storage On-The-Road Seminar Series

DB2 9 for LUW Advanced Database Recovery CL492; 4 days, Instructor-led

Online Storage Replacement Strategy/Solution

The Grid-it: the Italian Grid Production infrastructure

Hitachi Path Management & Load Balancing with Hitachi Dynamic Link Manager and Global Link Availability Manager

The Promise of Virtualization for Availability, High Availability, and Disaster Recovery - Myth or Reality?

A High Performance, High Reliability Perforce Server

Transcription:

CERN local High Availability solutions and experiences Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN 16.06.2006 1

Introduction Different h/w used for GRID services Various techniques & First experiences & Recommendations 16 June 2006 Thorsten Kleinwort CERN-IT 2

GRID Service issues Most GRID Services do not have built in failover functionality Stateful servers: Information lost on crash Domino effects on failures Scalability issues: e.g. >1 CE 16 June 2006 Thorsten Kleinwort CERN-IT 3

Different approaches Server hardware setup: Cheap off the shelf h/w satisfies only for batch nodes (WN) Improve reliability (cost ^) Dual Power supply (and test it works!) UPS Raid on system/data disks (hot swappable) Remote console/bios access Fiber Channel disk infrastructure CERN: Mid-Range-(Disk-)Server 16 June 2006 Thorsten Kleinwort CERN-IT 4

16 June 2006 Thorsten Kleinwort CERN-IT 5

Various Techniques Where possible, increase number of servers to improve service : WN, UI (trivial) BDII: possible, state information is very volatile and re-queried periodically CE, RB, more difficult, but possible for scalability 16 June 2006 Thorsten Kleinwort CERN-IT 6

Load Balancing For multiple Servers, use DNS alias/load balancing to select the right one: Simple DNS aliasing: For selecting master or slave server Useful for scheduled interventions Dedicated (VO-specific) host names, pointing to the same machine (E.g. RB) Pure DNS Round Robin (no server check) configurable in DNS: to equally balance load Load Balancing Service, based on Server load/availability/checks: Uses monitoring information to determine best machine(s) 16 June 2006 Thorsten Kleinwort CERN-IT 7

DNS Load Balancing LXPLUS.CERN.CH Update LXPLUS DNS Select least loaded SNMP/Lemon LB Server LXPLUSXXX 16 June 2006 Thorsten Kleinwort CERN-IT 8

Load balancing & scaling Publish several (distinguished) Servers for same site to distribute load: Can also be done for stateful services, because clients stick to the selected server Helps to reduce high load (CE, RB) 16 June 2006 Thorsten Kleinwort CERN-IT 9

Use existing Services All GRID applications that support ORACLE as a data backend (will) use the existing ORACLE Service at CERN for state information Allows for stateless and therefore load balanced application ORACLE Solution is based on RAC servers with FC attached disks 16 June 2006 Thorsten Kleinwort CERN-IT 10

High Availability Linux Add-on to standard Linux Switches IP address between two servers: If one Server crashes On request Machines monitor each other To further increase high availability: State information can be on (shared) FC 16 June 2006 Thorsten Kleinwort CERN-IT 11

HA Linux + FC disk SERVICE.CERN.CH Heartbeat FC Switches No single Point of failure FC Disks 16 June 2006 Thorsten Kleinwort CERN-IT 12

WMS: CE & RB CE & RB are stateful Servers: load balancing only for scaling: Done for RB and CE CE s and RB s on Mid-Range-Server h/w If one server goes, the information it keeps is lost [Possibly FC storage backend behind load balanced, stateless servers] 16 June 2006 Thorsten Kleinwort CERN-IT 13

GRID Services: DMS & IS DMS (Data Management System) SE: Storage Element FTS: File Transfer Service LFC: LCG File Catalogue IS (Information System) BDII: Berkley Database Information Index 16 June 2006 Thorsten Kleinwort CERN-IT 14

DMS: SE SE: castorgrid: Load balanced front end cluster, with CASTOR storage backend 8 simple machines (batch type) Here load balancing and failover works (but not for simple GRIDFTP) 16 June 2006 Thorsten Kleinwort CERN-IT 15

DMS:FTS & LFC LFC & FTS Service: Load Balanced Front End ORACLE database backend +FTS Service: VO agent daemons One warm spare for all: Gets hot when needed Channel agent daemons (Master & Slave) 16 June 2006 Thorsten Kleinwort CERN-IT 16

FTS: VO agent (failover) ALICE ATLAS SPARE CMS LHCB 16 June 2006 Thorsten Kleinwort CERN-IT 17

16 June 2006 Thorsten Kleinwort CERN-IT 18

IS: BDII BDII: Load balanced Mid-Range-Servers: LCG-BDII top level BDII PROD-BDII site BDII In preparation: <VO>-BDII State information in volatile cache re-queried within minutes 16 June 2006 Thorsten Kleinwort CERN-IT 19

16 June 2006 Thorsten Kleinwort CERN-IT 20

Example: BDII 17.11.2005 -> First BDII in production 29.11.2005 -> Second BDII in production 16 June 2006 Thorsten Kleinwort CERN-IT 21

AAS:MyPRoxy MyProxy has a replication function for a Slave Server, that allows read-only proxy retrieval [not used any more] Slave Server gets read-write copy from Master regularly to allow DNS switch over (rsync, every minute) HALinux handles failover 16 June 2006 Thorsten Kleinwort CERN-IT 22

HA Linux PX101 MYPROXY.CERN.CH PX102 Heartbeat rsync 16 June 2006 Thorsten Kleinwort CERN-IT 23

AAS: VOMS LCG-VOMS (HALinux) voms-ping voms102 voms103 VOMS DB ORACLE 10g RAC 16 June 2006 Thorsten Kleinwort CERN-IT 24

MS: SFT, GRVW, MONB Not seen as critical Nevertheless servers are migrated to Mid-Range-Servers GRVW has a hot-standby SFT & MONB on Mid-Range-Servers 16 June 2006 Thorsten Kleinwort CERN-IT 25