BUILDING HIGH-AVAILABILITY SERVICES IN JAVA MATTHIAS BRÄGER CERN GS-ASE Matthias.Braeger@cern.ch
AGENDA Measuring service availability Java Messaging Shared memory solutions Deployment Examples Summary 2
WHAT IS HIGH AVAILABILITY? 3
AVAILABILITY Failures happen! How do you build reliable systems regardless? How do you provide continuous, uninterrupted service? 4
THE USS YORKTOWN BUG 5
HUGE NEEDS FOR HA SYSTEMS 6
MEASURING SERVICE AVAILABILITY 7
CALCULATING AVAILABILITY Availability is usually expressed in percentage of uptime in a given year Uptime and availability are not synonymous! Example: A system can be up, but not available, as in the case of a network outage. The impact of unavailability varies with its time of occurrence 8
SCHEDULED AND UNSCHEDULED DOWNTIME (1/2) Scheduled downtime: Result of some logical, management-initiated event Examples: Patches to the system software that require reboot System configuration changes that require reboot 9
SCHEDULED AND UNSCHEDULED DOWNTIME (2/2) Unscheduled downtime: Usually arise from some physical event Examples: Hardware failure (power outages, failed CPU or RAM components, etc.) Software failure (application, middleware and operating system failures) Environmental anomaly (over-temperature related shutdown, logically or physically severed network connections, catastrophic security breaches) 10
CLASS OF NINES Availability % Downtime per year Downtime per month Downtime per week 90% ("one nine") 36.5 days 72 hours 16.8 hours 99% ("two nines") 3.65 days 7.20 hours 1.68 hours 99.5% 1.83 days 3.60 hours 50.4 minutes 99.9% ("three nines") 8.76 hours 43.8 minutes 10.1 minutes 99.95% 4.38 hours 21.56 minutes 5.04 minutes 99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes 99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds 99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds 99.99999% ("seven nines") 3.15 seconds 0.259 seconds 0.0605 seconds 11
AVAILABILITY ENVIRONMENT CLASSIFICATION (AEC) HRG* Class Indication Availability Description AEC-0 Conventional Service can be interrupted, data integrity is not essential AEC-1 AEC-2 Highly Reliable High Availability 99% Service can be interrupted, data integrity must be assured 99.9% Service is only allowed to be interrupted within scheduled time windows or minimal at main runtime AEC-3 Fault Resilient 99.99% Service must be assured without any downtime within well defined time windows or at main runtime AEC-4 Fault Tolerant 99.999% Service must be guaranteed without interruption, 24/7 service must be assured AEC-5 Disaster Tolerant 99.9999% Service must be available under all circumstances * Introduced by the Havard Research Group (HRG) 12
REASONS FOR UNAVAILABILITY OF ENTERPRISE IT SYSTEMS Lack of best practice: 1. change control 2. monitoring of the relevant components 3. requirements and procurement 4. operations 5. avoidance of network failures 6. avoidance of internal application failures 7. avoidance of external services that fail 8. physical environment 9. network redundancy 10. technical solution of backup 11. process solution of backup 12. physical location 13. infrastructure redundancy 14. storage architecture redundancy (From a survey among academic availability experts in 2010) 13
REACHING HIGH-AVAILABILITY High availability implies no human intervention to restore operation in complex systems. Example: Availability limit of 99.999% allows about one second of down time per day. The need for human intervention for maintenance actions in a large system will exceed this limit. 14
REACHING HIGH-AVAILABILITY Avoid Single-Point-of-Failure risks Redundancy of system critical components Passive redundancy, e.g. boat with two separate engines Active redundancy, e.g. Internet routing Fault-tolerance and robustness of the overall system Exhaustive testing before going in operation! Quickly reachable experts Good error messages and quick communication system Enough hardware spare-parts 15
SERVICE LEVEL AGREEMENTS (SLA) SLA are used to define the availability of a given service. Many systems have to be available 24/7 but some need high-availability only within certain time windows Example: Trading system of a stock market do not to be available on weekends or bank holidays. 16
JAVA MESSAGING 17
WHAT IS MESSAGING? Method of communication between software components or applications Messaging enables distributed communication that is loosely coupled Anonymous communication Sender and the receiver do not have to be available at the same time 18
WHAT IS THE JMS API? The Java Message Service is a Java API that allows applications to create, send, receive, and read messages Loosely coupling Asynchronous Reliable 19
WHEN CAN YOU USE JMS? The provider wants the components not to depend on information about other components interfaces The provider wants the application to run whether or not all components are up and running simultaneously. The application business model allows a component to send information to another and to continue to operate without receiving an immediate response. 20
JMS TECHNICAL TERMS Brokers: A JMS broker provides clients with connectivity, and message storage/delivery functions. Messages: A messages is an object that contains the required heading fields, optional properties, and data payload being transferred between JMS clients. Destinations: Destinations are maintained by the message broker. They can be either queues or topics. 21
MESSAGING MODELS (1/2) Point-to-Point Messaging Each message has only one consumer A sender and a receiver of a message have no timing dependencies The receiver acknowledges the successful processing of a message JMS allows messages to expire 22
MESSAGING MODELS (2/2) Publish/Subscribe Messaging Supports publishing messages to a particular message topic Neither the publisher nor the subscriber knows about each other Each message can have multiple consumers A client that subscribes to a topic can consume only messages published after the client has created a subscription 23
METHODS FOR DECREASING COUPLING Communication objects need to be serialized before sending and deserialized after sending How do I avoid unneeded clients restarts, when the communication object changes? Problem: Older versions of an application would throw exceptions when asked to deserialize new versions of the old object type. Newer versions of an application would throw exceptions when deserializing older versions of a type with missing data. Solution: Java serialized objects: Always define the serialversionuid or use XML or JSON for messaging! Version tolerant and better to handle 24
FREE JMS DISTRIBUTIONS Apache ActiveMQ OpenSource, well documented Provides API for different languages (Java, C++, Python, ) Apache Apollo ActiveMQ's next generation of messaging OpenJMS OpenMQ by Oracle StormMQ, cloud solution 25
SHARED MEMORY (DISTRIBUTED CACHING) 26
DEFINITION In computing, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient way of passing data Shared memory can be used to realize load balanced, redundant systems 27
IN-MEMORY DATABASES (IMDB) IMDB is a database management system that primarily relies on main memory Faster than disk-optimized databases Use cases: Applications where response time is critical Independence from the reference database 28
WHY SHARED MEMORY? Share data/state among many servers (e.g. web session sharing) Cache data (distributed cache) Cluster applications Provide secure communication among servers Distribute workload onto many servers Take advantage of parallel processing Provide fail-safe data management 29
SHARED MEMORY PRODUCTS FOR JAVA Hazlecast Peer-to-peer solution Based on java.util.{queue, Set, List, Map} Community edition available Terracotta Scalable array of in-memory cache servers Based on Java caching standard EHCache Allows caching over JVM memory limits Free version with limited functionalities available Memcached Free & open source, designed for dynamic web applications Simple solution for read-only use cases, but not designed for parallel read-write access Memcached server is atomic and not aware of other servers è no automatic failover Other Proprietary Solutions Oracle Coherence, JCache compliant SAP Hana 30
MEMCACHED EXAMPLE 31
TERRACOTTA ARCHITECTURE 32
EXAMPLES 33
SCENARIO 1: SIMPLE MONITORING Client Client Client Scenario 1: moderate data size high throughput short maintenance stops availability not critical low budget JMS broker SERVER same or different brokers JMS broker DAQ process DAQ process 34
SCENARIO 2: HIGH AVAILABILITY MONITORING Client Client Client Scenario 2: moderate data size average throughput min service interrupts high availability low budget JMS broker Terracotta standby JMS broker SERVER 1 Terracotta SERVER 2 JMS broker Clustered JMS brokers JMS broker DAQ process DAQ process DAQ process DAQ process 35
SCENARIO 3: BIG DATA MONITORING Client Client Client Client Scenario 3: large data set high throughput min service interrupts high availability JMS broker 1 Terracotta server array JMS broker n SERVER 1 SERVER m JMS broker 1 JMS broker k DAQ process DAQ process DAQ process DAQ process 36
SCENARIO 4: DISTRIBUTED STATELESS SYSTEM Client 1 Client k Scenario 4: Stateless (mirrored) daemons anonymous, asynchronous communication JMS broker 1 JMS broker n Daemon 1 Daemon m 37
SUMMARY WHAT DID WE LEARN? 38
SUMMARY Service availability Needs to be well defined within Service Level Agreement Measuring non-trivial and has to taken into account SLA High availability implies no human intervention to restore operation in complex systems JMS Provides anonymous, reliable messaging Suitable middleware for high-availability services Shared memory Simultaneously accessed by multiple programs (cluster) Can be used to realize In-Memory databases Allows realization of parallel processing 39
QUESTIONS? THANK YOU FOR YOUR ATTENTION! Matthias.Braeger@cern.ch 40