Monitoring and Managing a JVM Erik Brakkee & Peter van den Berkmortel
Overview About Axxerion Challenges and example Troubleshooting Memory management Tooling Best practices Conclusion
About Axxerion Axxerion is an Integrated Workplace Management System, aiming to make organizations more efficient enable collaboration adapt to an organization Axxerion organization aims to offer employees a stable and innovative workplace
Axxerion Metrics 10 developers 30 consultants 100 virtual and physical servers 14,000 monitored items 62 items per second 6 clusters 300+ clients in 14 countries 80,000 users
Middleware Stack Middleware stack CentOS 6 & 7 KVM Virtualization (standard linux) MySQL 5.6 JBoss 5 Java EE 5 Java 8 Eclipse Link (JPA 2)
Challenges Multi-tenancy Trace back issues Load Some issues only occur under load Difficult to reproduce in test After-the-fact troubleshooting Monitoring many variables Is there a problem? Distinguish between cause and effect Pro-active Self-defeating monitoring
Example of an Incident One year of troubleshooting Crashes in native code Replace native libraries Application logs freeze Automatic reload of log configuration Heap space problems Finalizer queue size monitoring Introduced our own log appender Server generates 25 million exceptions per hour Introduced exception log analysis
Example of an Incident One year of troubleshooting Server log fragments end up in unexpected places File descriptor errors Final solution found through monitoring and reverse thinking All of this was caused by an object cloner
Troubleshooting After-the-fact Which actions led to a problem? Users do not remember what they were doing Therefore, we log a lot Every login & logout Every user action Every update of a field of an object Anything else that can be useful
Troubleshooting Event logging Start and end statements At one-minute intervals Number of bytes allocated CPU usage Allows troubleshooting of individual threads
Troubleshooting Event logging 2015-11-03 13:29:23,509 FINE [com.axxerion.performance] (rp-worker-3821) performance @id 395972 @name rp-worker-3821 @usertime 0 @cputime 231809 @allocated 1336 @state RUNNABLE @blockedcount 0 @blockedtime -1 @waitedcount 0 @waitedtime -1 @lockinfo null @subevent workertask.start @realtime 9615399862973957 2015-11-03 13:29:23,559 FINE [com.axxerion.performance] (rp-worker-3821) performance @id 395972 @name rp-worker-3821 @usertime 10000000 @cputime 20074642 @allocated 2361888 @state RUNNABLE @blockedcount 11 @blockedtime -1 @waitedcount 1 @waitedtime -1 @lockinfo null @subevent workertask.end @realtime 9615399912328419 (based on ThreadMXBean)
Troubleshooting Exception logging Bulk approach Exceptions similar on stack frames Ignore message Custom log appender/handler Periodically write hash data Use hash as key Store hash on disk with full (first seen) exception
Troubleshooting com.axxerion.fault: error_no_administrator_defined: client axdsr1 at com.axxerion.server.b.b.execute(directmessagequeueentry.java:180) at com.axxerion.server.b.b.execute(directmessagequeueentry.java:31) at com.axxerion.r.execute(runnableexecutablewrapper.java:38) 10 more Caused by: Fault: error_no_administrator_defined: client axdsr1 at com.axxerion.server.util.contactutil.getadministratorsystemuserid(contactutil.java:384)... 8 more GIT principle: content addressable storage (hash == object) Secure hash based on exception class, stack frames, cause of exception (recurse) Log the hash in exception.log file Simple hash script can add up similar hashes
Troubleshooting Script output Processing /var/ /log/exceptions.log... Most occurring exceptions Occurrences: 1647 Hash: 2d4eed091276af79d526e88115f68a82bbb0c1de First exception of this kind seen: Time: 2015-09- 28 23:30:20.954301678 +0200 Level: INFO Log file sample 2015-11-02 16:28:16,184 INFO [com.axxerion.exceptiontracker] stats 2d4eed091276af79d526e88115f68a82bbb0c1de INFO occurrences delta 2 cumulative 1101 xyz.svc.webservices.data.serviceresponseexception: The specified object was not found at xyz.svc.ws.data.serviceresponse.internalthrow(serviceresponse.java:266) at xyz.svc.ws.data.request.execute(request.java:152) at xyz.svc.ws.data.svcservice.lookupitems(service.java:1364) at xyz.svc.ws.data.svcservice.bindtoitem(service.java:1407) Total number of unique exceptions: 160
Troubleshooting Troubleshooting experience Ad-hoc scripting Some issues take days others might take years Issues getting harder to find Psychology Relax Switch to stand-by Gather data This might be your only chance Get out of the denial phase Reverse thinking
Memory Management Some essential tips GC logs Heap usage Heap fragmentation Initiating occupancy fraction Maximum chunk size Trigger full GCs jmap -histo:live
Monitoring Identify core parameters User experience Response time Outstanding requests Technical Multiple full GCs within a short time frame Failing scheduled tasks (e.g. backups, restores) See also techblog.netflix.com for several ideas and tooling around monitoring.
Tooling Zabbix Flexible Easily customizable Monitor large numbers of servers
Tooling Technical approach Middleware, OS, Database Use standard items Use your own scripts Application level Use JMX as much as possible Internal statistics service
Tooling JStack Annotate and sort threads with script Deadlock detection and debugging Stack dump (jstack -l) has negligible disruption
Tooling JMap Heap dumps jmap -dump:format=b,file=$dumpfile $pid: stop-the-world up to 3 minutes After disaster Histogram jmap -histo: negligible disruption After-the-fact analysis Trigger full GC jmap -histo:live: stop-the-world up to 30 seconds Avoid heap fragmentation
Tooling JVisual VM MBean monitoring CPU sampling (5s interval) Do not use Memory sampling Profiling
Tooling Eclipse memory analyzer Works with 17 GB memory dumps Has saved the day Run on separate machine
Wishlist Missing JVM features Stop running threads Deallocated bytes per thread MetaSpace GC can trigger full GC Garbage collection Compacting Predictable performance Guaranteed no stop-the-world
Best Practices Simple statements can kill your server System.getProperty( ) exception.tostring() InetAddres.getLocalHost().getHostname() Third-party libraries Reflection bottleneck Risk management Runtime control over new features Assure a loop or recursion breaks off Do not use finalizers
Conclusion Results Server stability (99.998% uptime) Getting better Troubleshooting A lot of tools required Do not assume anything Go top-down Sometimes more issues than one Cause or effect There is no silver bullet
techblog.axxerion.com www.axxerion.com/nl/careers/ please rate my talk in the official J-Fall app