Techniques for implementing & running robust and reliable DB-centric Grid Applications



Similar documents
Internet Services. CERN IT Department CH-1211 Genève 23 Switzerland

High Availability Databases based on Oracle 10g RAC on Linux

SCALABILITY AND AVAILABILITY

Tier0 plans and security and backup policy proposals

Database Services for CERN

Oracle Database 11g: RAC Administration Release 2

CERN local High Availability solutions and experiences. Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN

Deployment Topologies

Contents. SnapComms Data Protection Recommendations

Database Decisions: Performance, manageability and availability considerations in choosing a database

Cloud Based Application Architectures using Smart Computing

Module 14: Scalability and High Availability

Oracle Database 10g: Backup and Recovery 1-2

Maximum Availability Architecture

Ecomm Enterprise High Availability Solution. Ecomm Enterprise High Availability Solution (EEHAS) Page 1 of 7

Informatica MDM High Availability Solution

Disaster Recovery for Oracle Database

<Insert Picture Here> WebLogic High Availability Infrastructure WebLogic Server 11gR1 Labs

Tushar Joshi Turtle Networks Ltd

Database Monitoring Requirements. Salvatore Di Guida (CERN) On behalf of the CMS DB group

Oracle Active Data Guard

Virtualization of Oracle Evolves to Best Practice for Production Systems

Rob Zoeteweij Zoeteweij Consulting

REDCENTRIC MANAGED DATABASE SERVICE SERVICE DEFINITION

High Availability Solutions for the MariaDB and MySQL Database

ORACLE DATABASE HIGH AVAILABILITY STRATEGY, ARCHITECTURE AND SOLUTIONS

Backup/Recovery Strategy and Impact on Applications. Jacek Wojcieszuk, CERN IT Database Deployment and Persistancy Workshop October, 2005

INDIA September 2011 virtual techdays

Microsoft SQL Server 2008 R2 Enterprise Edition and Microsoft SharePoint Server 2010

Inge Os Sales Consulting Manager Oracle Norway

Disaster Recovery Solution Achieved by EXPRESSCLUSTER

Introduction to Database as a Service

Microsoft SharePoint 2010 on VMware Availability and Recovery Options. Microsoft SharePoint 2010 on VMware Availability and Recovery Options

Oracle Database Backup & Recovery, Flashback* Whatever, & Data Guard

High Availability Essentials

Comparing TCO for Mission Critical Linux and NonStop

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY" Tokyo. Koln Sebastopol. Cambridge Farnham.

The Future of PostgreSQL High Availability Robert Hodges - Continuent, Inc. Simon Riggs - 2ndQuadrant

The Microsoft Large Mailbox Vision

Architectures Haute-Dispo Joffrey MICHAÏE Consultant MySQL

Advanced Oracle DBA Course Details

Case Study: Oracle E-Business Suite with Data Guard Across a Wide Area Network

<Insert Picture Here> Oracle In-Memory Database Cache Overview

Maximum Availability Architecture

Neverfail for Windows Applications June 2010

High Availability Infrastructure for Cloud Computing

Deploy App Orchestration 2.6 for High Availability and Disaster Recovery

Ingres Replicated High Availability Cluster

Oracle Database Public Cloud Services

Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper

How To Fix A Powerline From Disaster To Powerline

Oracle 11g: RAC and Grid Infrastructure Administration Accelerated R2

Scalable Architecture on Amazon AWS Cloud

be architected pool of servers reliability and

Eliminate SQL Server Downtime Even for maintenance

DISASTER RECOVERY STRATEGIES FOR ORACLE ON EMC STORAGE CUSTOMERS Oracle Data Guard and EMC RecoverPoint Comparison

Oracle Maximum Availability Architecture with Exadata Database Machine. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

Caché High Availability Guide

Cisco Active Network Abstraction Gateway High Availability Solution

Oracle Database Backups and Disaster Autodesk

Oracle Database Cloud Service Rick Greenwald, Director, Product Management, Database Cloud

Architecting For Failure Why Cloud Architecture is Different! Michael Stiefel

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April ISSN

2009 Oracle Corporation 1

SQL Server 2012/2014 AlwaysOn Availability Group

High Availability and Scalability for Online Applications with MySQL

Oracle Maps Cloud Service Enterprise Hosting and Delivery Policies Effective Date: October 1, 2015 Version 1.0

Kingston Communications Virtualisation Platforms

GoGrid Implement.com Configuring a SQL Server 2012 AlwaysOn Cluster

Blackboard Collaborate Web Conferencing Hosted Environment Technical Infrastructure and Security

Evolution of Database Replication Technologies for WLCG

Bigdata High Availability (HA) Architecture

Chapter 12 Network Administration and Support

Building Reliable, Scalable AR System Solutions. High-Availability. White Paper

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Database high availability

Aaron Werman.

Multiple Public IPs (virtual service IPs) are supported either to cover multiple network segments or to increase network performance.

Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation

High Availability and Disaster Recovery Solutions for Perforce

Performance Monitoring AlwaysOn Availability Groups. Anthony E. Nocentino

One Solution for Real-Time Data protection, Disaster Recovery & Migration

Scale and Availability Considerations for Cluster File Systems. David Noy, Symantec Corporation

Learn Oracle WebLogic Server 12c Administration For Middleware Administrators

TOP FIVE REASONS WHY CUSTOMERS USE EMC AND VMWARE TO VIRTUALIZE ORACLE ENVIRONMENTS

Registry Update. John Dickinson. Nominet UK

MaximumOnTM. Bringing High Availability to a New Level. Introducing the Comm100 Live Chat Patent Pending MaximumOn TM Technology

An Oracle White Paper January A Technical Overview of New Features for Automatic Storage Management in Oracle Database 12c

How to Migrate your Database to Oracle Exadata. Noam Cohen, Oracle DB Consultant, E&M Computing

Applying traditional DBA skills to Oracle Exadata. Marc Fielding March 2013

Virtualization of CBORD Odyssey PCS and Micros 3700 servers. The CBORD Group, Inc. January 13, 2007

Transcription:

Techniques for implementing & running robust and reliable DB-centric Grid Applications International Symposium on Grid Computing 2008 11 April 2008 Miguel Anjo, CERN - Physics Databases

Outline Robust and DBcentric applications Technologies behind Oracle Real Application Clusters Oracle Streams Oracle Data Guard DNS Load balancing Implement robust applications What to expect What to do Other planning 2

Robust & DB-centric Applications Robust: (adj.) vigorous, powerfully built, strong Resilient: (adj.) an ability to recover from or adjust easily to misfortune or change DB-centric applications: essential data is stored in a database 3

Technologies behind Oracle Real Application Clusters Oracle Data Guard DNS Load Balancing

Oracle RAC architecture Oracle Real Application Clusters 10g - Foundation for Grid Computing http://www.oracle.com/technology/products/database/clustering/index.html 5

Oracle RAC Architecture Applications consolidated on large clusters Redundant and homogeneous HW across each RAC 6

Architecture 7

Architecture (Oracle services) Resources distributed among Oracle services Applications assigned to dedicated service Applications components might have different services Service reallocation not always completely transparent CMS_COND Preferred A1 A2 A3 A4 A5 A6 A7 ATLAS_DQ2 Preferred A2 A3 A4 A5 A6 A7 A1 LCG_SAM A5 A3 A1 A2 Preferred Preferred Preferred A4 LCG_FTS A4 A5 A6 A7 Preferred A1 A2 A3 CMS_SSTRACKER Preferred Preferred Preferred Preferred Preferred Preferred Preferred Preferred CMS_PHEDEX A2 Preferred Preferred Preferred A1 A3 A4 A5 CMS RAC Node # 1 2 3 4 5 6 7 8 CMS_COND Preferred A1 A2 A3 A4 A5 A6 ATLAS_DQ2 Preferred A2 A3 A4 A5 A6 A1 LCG_SAM A4 A2 Preferred A1 Preferred Preferred A3 LCG_FTS A3 A4 A5 A6 Preferred A1 A2 CMS_SSTRACKER Preferred Preferred Preferred Preferred Preferred Preferred Preferred CMS_PHEDEX A1 Preferred Preferred Preferred A2 A3 A4 8

Architecture (Virtual IP) Service s connection string mentions all virtual IPs It connects to a random virtual IP (client load balance) Listener sends connection to least loaded node where service runs (server load balance) $ sqlplus db_user@lcg_fts srv1-v srv2-v srv3-v srv4-v Virtual IP listener srv1 listener srv2 listener srv3 listener srv4 LCG_FTS 9

Architecture (load balancing) Used also for rolling upgrades (patch applied node by node) Small glitches might happen during VIP move no response / timeout / error applications need to be ready for this catch errors, retry, not hang $ sqlplus db_user@lcg_fts srv1-v srv3-v srv4-v Virtual IP srv2-v listener srv1 listener srv3 listener srv4 LCG_FTS 10

Oracle Streams Streams data to external databases (Tier1) Limited throughput Can be used for few applications Create read-only copy of DB Application can failover to copy

Oracle Data guard Use as on-disk backup Physical stand-by RAC with small lag (few hours) Can be open read-only to recover from human errors Switch to primary mode as Fast disaster recovery mechanism 12

DBA Main concerns (based on experience) Human errors By DBA on administrative tasks Use and test procedures, not always easy task By developers Restrict access to production DB to developers Logical corruption / Oracle SW bugs data inserted in wrong schemas Patches better tested on pilot environments before deployment in production Oracle software Security quarterly security patches released by Oracle Increasing amount of stored data Tapes slow as 5 years ago, backups take longer Move to backup on disks prune old redundant data/summarizing

The Databases reality at CERN Databases for the world s biggest machine: particle collider 18 database RACs (up to 8 nodes) 124 servers, 150 disk arrays (+1700 disks) Or: 450 CPU cores, 900GB of RAM, 550 TB of raw disk space(!) Connected to 10 Tier-1 sites for synchronized databases Sharing policies and procedures Team of 5 DBAs + service coordinator and link to experiments 24x7 best effort service for production RACs Maintenance without downtime within RAC features 0.02% services unavailability (average for 2008) = 1.75 hours/year 0.32% server unavailability (average for 2008) = 28 hours/year Patch deployment, broken hardware 14

DNS Load balancing Application Cluster Connecting to node4.cern.ch A: application.cern.ch resolves to: node4.cern.ch node1.cern.ch node2.cern.ch node3.cern.ch DNS Server ` Q: What is the IP address of application.cern.ch? 15

DNS Round Robin Allows basic load distribution lxplus001 ~ > host lxplus.cern.ch lxplus.cern.ch has address 137.138.4.171 (1) lxplus.cern.ch has address 137.138.4.177 (2) lxplus.cern.ch has address 137.138.4.178 (3) lxplus.cern.ch has address 137.138.5.72 (4) lxplus.cern.ch has address 137.138.4.169 (5) lxplus001 ~ > host lxplus.cern.ch lxplus.cern.ch has address 137.138.4.177 (2) lxplus.cern.ch has address 137.138.4.178 (3) lxplus.cern.ch has address 137.138.5.72 (4) lxplus.cern.ch has address 137.138.4.169 (5) lxplus.cern.ch has address 137.138.4.171 (1) No withdrawal of overloaded or failed nodes 16

DNS Load Balancing and Failover Requires an additional server = arbiter Monitors the cluster members Adds and withdraw nodes as required Updates are transactional Client never sees an empty list lxplus001 ~ > host lxplus.cern.ch lxplus.cern.ch has address 137.138.4.171 137.138.5.80 137.138.4.168 lxplus.cern.ch has address 137.138.4.177 137.138.4.168 137.138.5.71 lxplus.cern.ch has address 137.138.4.178 137.138.4.171 137.138.5.74 lxplus.cern.ch has address 137.138.5.72 137.138.4.174 137.138.4.165 lxplus.cern.ch has address 137.138.4.169 137.138.5.76 137.138.4.166 17

Application Load Balancing node1: metric=24 node2: metric=48 node3: metric=35 node4: metric=27 SNMP 2 best nodes for application.cern.ch: node1 node4 DynDNS Application Cluster Load Balancing Arbiter A: application.cern.ch resolves to: node4.cern.ch node1.cern.ch DNS Server Connecting to node4.cern.ch ` Q: What is the IP address of application.cern.ch? 18

DB-centric Applications Development cycle What to expect What to do How to survive planned interventions 19

Apps and Database release cycle Applications release cycle Development service Validation service Production service Database software release cycle Production service version 10.2.0.n Validation service version 10.2.0.(n+1) Production service version 10.2.0.(n+1)

What to expect Network glitches Any network failure Disconnects Idle time, network failure Failover Rolling upgrades, server failure Interventions (planned, unplanned) Upgrades, patches 21

What application should do General guidelines for DB apps Primary keys, foreign keys, indexes, bind variables Reconnect Catch disconnect errors Reconnect before giving error Handle errors Catch DB common errors Rollback transaction and re-execute, if needed Throttle retries After 2/3 consecutive failures, wait before retry Timeout calls If load is too high DB might stop responding Error better than no results Timeout connection to database 22

Survive problems and interventions Have some cache/log when possible Next few transfers, results of last/common requests Create queue on server for later apply Keep last results as static (for graphs ) Have a read-only copy of the DB somewhere Oracle Streams being tested at CERN Application should be able to failover and failback 23

Check space constraints Space/disk is almost free but fragmentation is still around bad organisation can lead to slowness Keep DB tidy Drop unnecessary data Archive it (compressed, not indexed) Aggregate it (keep only summaries) Oracle partitioning used at CERN 24

Monitoring and logging Application responsiveness Time to perform a operation Database availability Log errors and check them Automatise most of the error correction Pass it to operators Weekly reports on application and database Ask your DBAs Easy to spot changes in behaviour 25

Summary Cluster database using redundant hardware Split applications at database, isolate if needed Oracle DataGuard as on disk backup for fast recoveries, human errors Application stateless as possible DNS load balancing with Arbiter to get better servers Application should target to run also during DB unavailability Local cache/queues Predict space usage in future (and cleanup) Monitor application usage, errors, times Involve your DBA as much as possible 26

Questions? and answers. References: CERN Physics Databases wiki: General advice Connection management http://cern.ch/phydb/wiki WLCG Service Reliability Workshop (Nov07) DNS Load balancing (slides from Vlado Bahyl) Case studies http://indico.cern.ch/conferenceotherviews.py?confid=20080 WLCG Collaboration Workshop (Tier0/Tier1/Tier2)(Apr08) http://indico.cern.ch/conferenceotherviews.py?confid=6552 27