Enabling Java in Latency Sensitive Environments



Similar documents
Enabling Java in Latency Sensitive Environments

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing

JVM Garbage Collector settings investigation

How NOT to Measure Latency

Azul Pauseless Garbage Collection

Zulu by Azul OpenJDK for Azure

Understanding Java Garbage Collection

Azul's Zulu JVM could prove an awkward challenge to Oracle's Java ambitions

Advanced Liferay Architecture: Clustering and High Availability

Understanding Hardware Transactional Memory

Using jvmstat and visualgc to Solve Memory Management Problems

Java Performance Tuning

Garbage Collection in NonStop Server for Java

Oracle Corporation Proprietary and Confidential

Low level Java programming With examples from OpenHFT

Java Performance. Adrian Dozsa TM-JUG

Mission-Critical Java. An Oracle White Paper Updated October 2008

Zing Vision. Answering your toughest production Java performance questions

Memory Management in the Java HotSpot Virtual Machine

Extreme Performance with Java

Ground up Introduction to In-Memory Data (Grids)

MID-TIER DEPLOYMENT KB

BigMemory & Hybris : Working together to improve the e-commerce customer experience

Liferay Performance Tuning

What s Cool in the SAP JVM (CON3243)

Enterprise Edition Scalability. ecommerce Framework Built to Scale Reading Time: 10 minutes

Java Garbage Collection Basics

Java Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload

High-Availability. Configurations for Liferay Portal. James Min. Senior Consultant / Sales Engineer, Liferay, Inc.

Introduction to Spark and Garbage Collection

BigMemory: Providing competitive advantage through in-memory data management

JBoss Cookbook: Secret Recipes. David Chia Senior TAM, JBoss May 5 th 2011

WITH BIGMEMORY WEBMETHODS. Introduction

Informatica Master Data Management Multi Domain Hub API: Performance and Scalability Diagnostics Checklist

Java Monitoring. Stuff You Can Get For Free (And Stuff You Can t) Paul Jasek Sales Engineer

General Introduction

Kernel Optimizations for KVM. Rik van Riel Senior Software Engineer, Red Hat June

Mark Bennett. Search and the Virtual Machine

Angelika Langer The Art of Garbage Collection Tuning

PTC System Monitor Solution Training

Azul pitches Docker as alternative to virtualization for heavy-duty Java applications

Top 10 reasons your ecommerce site will fail during peak periods

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Garbage Collection in the Java HotSpot Virtual Machine

New Relic & JMeter - Perfect Performance Testing

Deployment Checklist. Liferay Portal 6.1 Enterprise Edition

Azul Compute Appliances

Performance Optimization Guide

Berlin Mainframe Summit. Java on z/os IBM Corporation

Performance Tuning for Oracle WebCenter Content 11g: Strategies & Tactics CHRIS ROTHWELL & PAUL HEUPEL FISHBOWL SOLUTIONS, INC.

Performance Management for Cloudbased STC 2012

Liferay Portal s Document Library: Architectural Overview, Performance and Scalability

x64 Servers: Do you want 64 or 32 bit apps with that server?

Monitoring Best Practices for COMMERCE

Validating Java for Safety-Critical Applications

Practical Performance Understanding the Performance of Your Application

Java Troubleshooting and Performance

JBoss Enterprise MIDDLEWARE

High Frequency Trading and NoSQL. Peter Lawrey CEO, Principal Consultant Higher Frequency Trading

An Oracle White Paper October, Enterprise Manager 12c Cloud Control Sizing Guidelines

Real Time Cloud Computing

An Oracle White Paper March Load Testing Best Practices for Oracle E- Business Suite using Oracle Application Testing Suite

Useful metrics for Interpreting.NET performance. UKCMG Free Forum 2011 Tuesday 22nd May Session C3a. Introduction

talent. technology. true business value

Enterprise Manager Performance Tips

Beat the Beast - Java Performance Problem Tracking. with you. Java One - San Francisco, , Miroslaw Bartecki

Oracle JRockit Mission Control Overview

The Network and The Cloud: Addressing Security And Performance. How Your Enterprise is Impacted Today and Tomorrow

THE BUSY DEVELOPER'S GUIDE TO JVM TROUBLESHOOTING

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Delivering Quality in Software Performance and Scalability Testing

Tuning WebSphere Application Server ND 7.0. Royal Cyber Inc.

Monitoring and Managing a JVM

Realtime Linux Kernel Features

Eloquence Training What s new in Eloquence B.08.00

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

Blackboard Learn TM, Release 9 Technology Architecture. John Fontaine

Lua as a business logic language in high load application. Ilya Martynov ilya@iponweb.net CTO at IPONWEB

TDA - Thread Dump Analyzer

Performance Optimization For Operational Risk Management Application On Azure Platform

How To Improve Performance On An Asa 9.4 Web Application Server (For Advanced Users)

Agility Database Scalability Testing

Office 2010 OPK Q&A System Builders

GIVE YOUR ORACLE DBAs THE BACKUPS THEY REALLY WANT

<Insert Picture Here> Oracle In-Memory Database Cache Overview

The Fundamentals of Tuning OpenJDK

An Open Source Memory-Centric Distributed Storage System

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

Cloud Computing. Chapter 1 Introducing Cloud Computing

WebSphere Performance Monitoring & Tuning For Webtop Version 5.3 on WebSphere 5.1.x

University of Southern California Shibboleth High Availability with Terracotta

Transcription:

Enabling Java in Latency Sensitive Environments 1 Matt Schuetze Azul Director of Product Management Matt Schuetze, Product Manager, Azul Systems Utah JUG, Murray UT, November 20, 2014 Austin Java Users Group @azulsystems Austin, Texas

High Level Agenda Welcome to all Austin JUG members! Intro, jitter vs. JITTER Java in a low latency application world The (historical) fundamental problems What people have done to try to get around them What if the fundamental problems were eliminated? What 2015 looks like for Low Latency Java developers Real World Case Studies 2

3

Is jitter a proper word for this? Answer: no its not jitter at all. Its phase changes. 4 99% ile is ~60 usec Max is ~30,000% higher than typical

About Azul Systems Zing, Zulu, and everything about Java Virtual Machines We make scalable Virtual Machines Have built whatever it takes to get job done since 2002 3 generations of custom SMP Multi-core HW (Vega) Now Pure software for commodity x86 (Zing) Certified OpenJDK (Zulu) Known for Low Latency, Consistent execution, and Large data set excellence C4 Vega 5

Java in the low latency world 6

Java in a low latency world Yep, low latency Java is goin down for real Why do people use Java for low latency apps? Are they crazy? No. There are good, easy to articulate reasons Projected lifetime cost Developer productivity Time-to-product, Time-to-market,... Leverage, ecosystem, ability to hire 7

e.g. customer answer to: Why do you use Java in Algo Trading? Strategies have a shelf life We have to keep developing and deploying new ones Only one out of N is actually productive Profitability therefore depends on ability to successfully deploy new strategies, and on the cost of doing so Our developers seem to be able to produce 2x-3x as much when using a Java environment as they would with C++... 8

So what is the problem? Is Java Slow? No A good programmer will get roughly the same speed from both Java and C++ A bad programmer won t get you fast code on either The 50% ile and 90% ile are typically excellent... It s those pesky occasional stutters and stammers and stalls that are the problem... Ever hear of Garbage Collection? 9

Java s Achilles heel 10

Stop-The-World Garbage Collection: How bad is it? Let s ignore the bad multi-second pauses for now... Low latency applications regularly experience small, minor GC events that range in the 10s of msec Frequency directly related to allocation rate In turn, directly related to throughput So we have great 50%, 90%. Maybe even 99% But 99.9%, 99.99%, Max, all suck So bad that it affects risk, profitability, service expectations, etc. 11

STW-GC effects in a low latency application 99% ile is ~60 usec Max is ~30,000% higher than typical 12

One way to deal with Stop-The-World GC I cannot see it, so it cannot see me. 13

More Stop-The-World GC avoidance Time for a bigger rug. 14

What do actual low latency developers do about it? They use Java instead of Java They write in the Java syntax They avoid allocation as much as possible E.g. They build their own object pools for everything They write all the code they use (no 3rd party libs) They train developers for their local discipline In short: They revert to many of the practices that hurt productivity. They lose out on much of Java. 15

Another way to cope: Creative Language Drawn from evil vendor marketing literature Guarantee a worst case of 5 msec, 99% of the time Translation: 1% will be far worse than worst case Mostly Concurrent, Mostly Incremental Translation: Will at times exhibit long monolithic stop-theworld pauses Fairly Consistent Translation: Will sometimes show results well outside this range Typical pauses in the tens of milliseconds Translation: Some pauses are much longer than tens of milliseconds 16

What do low latency (Java) developers get for all their effort? They still see pauses (usually ranging to tens of msec) But they get fewer (as in less frequent) pauses And they see fewer people able to do the job And they have to write EVERYTHING themselves And they get to debug malloc/free patterns again And they can only use memory in certain ways... Some call it fun... Others duct tape engineering... 17

There is a fundamental problem: Stop-The-World GC mechanisms are contradictory to the fundamental requirements of low latency & low jitter apps 18

Sustainable Throughput The throughput achieved while safely maintaining service levels Unsustainable Throughout 19

It s an industry-wide problem 20

It was an industry-wide problem It s 2015... Now we have Zing. 21

The common GC behavior across ALL currently shipping (non-zing) JVMs ALL use a Monolithic Stop-the-world NewGen small periodic pauses (small as in 10s of msec) pauses more frequent with higher throughput or allocation rates Development focus for ALL is on OldGen collectors Focus is on trying to address the many-second pause problem Usually by sweeping it farther and farther the rug Mostly X (e.g. mostly concurrent ) hides the fact that they refer only to the OldGen part of the collector E.g. CMS, G1, Balanced... all are OldGen-only efforts ALL use a Fallback to Full Stop-the-world Collection Used to recover when other mechanisms (inevitably) fail Also hidden under the term Mostly... 22

At Azul, STW-GC was addressed head-on Trivia: Azul as a company founded predominantly around this one premise plaguing then Java servers We decided to focus on the right core problems Scale & productivity being limited by responsiveness Even short GC pauses are considered a problem Responsiveness must be unlinked from key metrics: Transaction Rate, Concurrent users, Data set size, etc. Heap size, Live Set size, Allocation rate, Mutation rate Responsiveness must be continually sustainable Can t ignore rare but periodic events Eliminate ALL Stop-The-World Fallbacks Any STW fallback is a real-world failure 23

The Zing C4 Collector Continuously Concurrent Compacting Collector Concurrent, compacting old generation Concurrent, compacting new generation No stop-the-world fallback Always compacts, and always does so concurrently 24

Benefits 25

Stay Responsive Even when traffic patterns change without warning 7x Load Increase 30 minute span shows elevated load long after event, yet no pauses. 26

Handle Real World traffic patterns One second view of transactions. Not constant. Not random either. Bursty is normal. Red line shows where order pricing arrival rate would be if constant 27

Achieve Measureable Benefits From joint LMAX/Azul talk at QCon London, March 2015 Zing helped LMAX tame GC-related latency outlier pauses Highly-engineered system: 4ms every 30 seconds down to 1ms every 2 hours Less well-tuned system: 50ms every 30 seconds down to 3ms every 15 minutes No more unexpected/unwanted old-gen pauses caused by external behavior CMS STW intra-day, generally ~500ms, gone Removed source of backpressure on latency critical path. Pre-Azul these would occur less predictably, but multiple times a week. 28

This is not just Theory 29 jhiccup A tool that measures and reports (as your application is running) if your JVM is running all the time

Discontinuities in Java execution - Easy To Measure We call these hiccups A telco App with a bit of a problem 30

Oracle HotSpot (pure newgen) Zing Low latency trading application 31

Oracle HotSpot (pure newgen) Zing Low latency trading application 32

Oracle HotSpot (pure newgen) Zing Low latency - Drawn to scale 33

It s not just for Low Latency Just as easy to demonstrate for humanresponse-time apps 34

Oracle HotSpot CMS, 1GB in an 8GB heap Zing, 1GB in an 8GB heap Portal Application, slow Ehcache churn 35

Oracle HotSpot CMS, 1GB in an 8GB heap Zing, 1GB in an 8GB heap Portal Application, slow Ehcache churn 36

Oracle HotSpot CMS, 1GB in an 8GB heap Zing, 1GB in an 8GB heap Portal Application - Drawn to scale 37

A Recent E-Commerce Case Study 38

Cyber Monday comes earlier every year General trends of real world e-commerce traffic 39

Human-Time Real World Latency Case Specific e-tail customer based in Salt Lake City, Utah. Web retail site faces spike loads every year over Thanksgiving through Cyber Monday. Site latency suffers at peak viewing and buying times, discouraging shoppers and leaving abandoned carts. Hard to predict height of surge, just know its big, far higher than regular traffic 362 other days of the year. New features like gallery search (Solr/Lucene) added higher memory footprint, longer GC times. Staff spent lots of effort tuning HotSpot. 40

Real World Latency Results Timeframe was fall 2014. Customer studied Azul, met at Strata, NYC Discussion led to Zing as viable alternative Customer ran pilot tests with positive results. Needed one Linux adjustment, otherwise same server gear. POC on customer live system showed better than expected latency profiles. No more GC tuning! Experienced a stable and profitable Thanksgiving 2014 weekend. 41

Remind me how GC tuning sucks 42

Java GC tuning is hard Examples of actual command line GC tuning terms: Java -Xmx12g -XX:MaxPermSize=64M -XX:PermSize=32M -XX:MaxNewSize=2g -XX:NewSize=1g -XX:SurvivorRatio=128 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:MaxTenuringThreshold=0 -XX:CMSInitiatingOccupancyFraction=60 -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:ParallelGCThreads=12 -XX:LargePageSizeInBytes=256m Java Xms8g Xmx8g Xmn2g -XX:PermSize=64M -XX:MaxPermSize=256M -XX:-OmitStackTraceInFastThrow -XX:SurvivorRatio=2 -XX:-UseAdaptiveSizePolicy -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSParallelRemarkEnabled -XX:+CMSParallelSurvivorRemarkEnabled -XX:CMSMaxAbortablePrecleanTime=10000 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=63 -XX:+UseParNewGC Xnoclassgc 43

A few GC tuning flags Source: Word Cloud created by Frank Pavageau in his Devoxx FR 2012 presentation titled Death by Pauses 44

Complete guide to Zing GC tuning java -Xmx40g 45

Any other problems beyond GC? 46

JVMs make many tradeoffs often trading speed vs. outliers Some speed techniques come at extreme outlier costs E.g. ( regular ) biased locking E.g. counted loops optimizations Deoptimization Lock deflation Weak References, Soft References, Finalizers Time To Safe Point (TTSP) 47

Time To Safepoint: Your new #1 enemy Once GC itself was taken care of Many things in a JVM (still) use a global safepoint All threads brought to a halt, at a safe to analyze point in code, and then released after work is done. E.g. GC phase shifts, Deoptimization, Class unloading, Thread Dumps, Lock Deflation, etc. etc. A single thread with a long time-to-safepoint path can cause an effective pause for all other threads. Consider this a variation on Amdahl s law. Many code paths in the JVM are long... 48

Time To Safepoint (TTSP), the most common examples Array copies and object clone() Counted loops Many other variants in the runtime... Measure, Measure, Measure... Zing has a built-in TTSP profiler At Azul, the CTO walks around with a 0.5msec beat down stick... 49

OS related stuff Once GC and TTSP are taken care of OS related hiccups tend to dominate once GC and TTSP are removed as issues. Take scheduling pressure seriously (Duh?) Hyper-threading (good? bad?) Swapping (Duh!) Power management Transparent Huge Pages (THP).... 50

Takeaway: In 2015, Real Java is finally viable for low latency applications GC is no longer a dominant issue, even for outliers 2-3 msec worst case with easy tuning < 1 msec worst case is very doable No need to code in special ways any more You can finally use real Java for everything You can finally 3rd party libraries without worries You can finally use as much memory as you want You can finally use regular (good) programmers 51

One-liner Takeaway: Zing: the cure for your Java hiccups 52

Compulsory Marketing Pitch 53

Azul Hot Topics Zing 15.05 imminent 1TB heap ReadyNow! JMX Oracle Linux Zing for Big Data Cloudera CDH5 cert Cassandra paper Spark is in Zing open source program Zing for Cloud Amazon AMIs Rackspace OnMetal compat Docker in R&D Zulu Azure Gallery JSE Embedded 8u45 in the chute 54

Q&A and In Closing Go get some Zing today! At very least download JHiccup. Grab a Zing Free Trial card. Let s talk about best BBQ in Texas @schuetzematt azul.com 55