Low level Java programming With examples from OpenHFT

Similar documents

High Frequency Trading and NoSQL. Peter Lawrey CEO, Principal Consultant Higher Frequency Trading

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing

Tomcat Tuning. Mark Thomas April 2009

Practical Performance Understanding the Performance of Your Application

Delivering Quality in Software Performance and Scalability Testing

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

How NOT to Measure Latency

Java Performance. Adrian Dozsa TM-JUG

Understanding Hardware Transactional Memory

Cloud Operating Systems for Servers

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

Datacenter Operating Systems

Methods to achieve low latency and consistent performance

MID-TIER DEPLOYMENT KB

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Implementing a Well- Performing and Reliable Portal

Apache Tomcat Tuning for Production

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Oracle Corporation Proprietary and Confidential

Realtime Linux Kernel Features

Agenda. Tomcat Versions Troubleshooting management Tomcat Connectors HTTP Protocal and Performance Log Tuning JVM Tuning Load balancing Tomcat

Adobe LiveCycle Data Services 3 Performance Brief

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Contents Introduction... 5 Deployment Considerations... 9 Deployment Architectures... 11

Eloquence Training What s new in Eloquence B.08.00

Holly Cummins IBM Hursley Labs. Java performance not so scary after all

Apache and Tomcat Clustering Configuration Table of Contents

Best Practices for Deploying Citrix XenDesktop on NexentaStor Open Storage

Java Garbage Collection Basics

Top 10 Performance Tips for OBI-EE

Ground up Introduction to In-Memory Data (Grids)

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

Zing Vision. Answering your toughest production Java performance questions

BigMemory: Providing competitive advantage through in-memory data management

Java Performance Tuning

Validating Java for Safety-Critical Applications

Java DB Performance. Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860

Multi-core and Linux* Kernel

WebSphere Performance Monitoring & Tuning For Webtop Version 5.3 on WebSphere 5.1.x

Garbage Collection in the Java HotSpot Virtual Machine

Efficient Architectures for Low Latency and High Throughput Trading Systems on the JVM

J2EE-JAVA SYSTEM MONITORING (Wily introscope)

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

How To Use Java On An Ipa (Jspa) With A Microsoft Powerbook (Jempa) With An Ipad And A Microos 2.5 (Microos)

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Configuring Apache Derby for Performance and Durability Olav Sandstå

Analysis of VDI Storage Performance During Bootstorm

Boost SQL Server Performance Buffer Pool Extensions & Delayed Durability

Agility Database Scalability Testing

Resource Utilization of Middleware Components in Embedded Systems

Scaling Progress OpenEdge Appservers. Syed Irfan Pasha Principal QA Engineer Progress Software

Informatica Master Data Management Multi Domain Hub API: Performance and Scalability Diagnostics Checklist

Developing Scalable Java Applications with Cacheonix

Tuning Your GlassFish Performance Tips. Deep Singh Enterprise Java Performance Team Sun Microsystems, Inc.

Scaling Analysis Services in the Cloud

Kernel Optimizations for KVM. Rik van Riel Senior Software Engineer, Red Hat June

IT Best Practices Audit TCS offers a wide range of IT Best Practices Audit content covering 15 subjects and over 2200 topics, including:

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

Using jvmstat and visualgc to Solve Memory Management Problems

An Oracle White Paper March Load Testing Best Practices for Oracle E- Business Suite using Oracle Application Testing Suite

Performance Optimization For Operational Risk Management Application On Azure Platform

BigMemory & Hybris : Working together to improve the e-commerce customer experience

I3: Maximizing Packet Capture Performance. Andrew Brown

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Hadoop Size does Hadoop Summit 2013

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

bla bla OPEN-XCHANGE Open-Xchange Hardware Needs

CloudCmp:Comparing Cloud Providers. Raja Abhinay Moparthi

Apache Derby Performance. Olav Sandstå, Dyre Tjeldvoll, Knut Anders Hatlen Database Technology Group Sun Microsystems

Oracle WebLogic Thread Pool Tuning

farmerswife Contents Hourline Display Lists 1.1 Server Application 1.2 Client Application farmerswife.com

Multi-core Programming System Overview

In Memory Accelerator for MongoDB

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

Multi-Threading Performance on Commodity Multi-Core Processors

Benchmarking Hadoop & HBase on Violin

WebSphere Architect (Performance and Monitoring) 2011 IBM Corporation

Liferay Portal s Document Library: Architectural Overview, Performance and Scalability

Oracle WebLogic Server 11g Administration

This talk is mostly about Data Center Replication, but along the way we'll have to talk about why you'd want transactionality arnd the Low-Level API.

Flash-Friendly File System (F2FS)

Parallel Replication for MySQL in 5 Minutes or Less

System Requirements Table of contents

3 Examples of Reliability Testing. Dan Downing, VP Testing Services MENTORA GROUP

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

Transcription:

Low level Java programming With examples from OpenHFT Peter Lawrey CEO and Principal Consultant Higher Frequency Trading. Presentation to Joker 2014 St Petersburg, October 2014.

About Us Higher Frequency Trading is a small consulting and software development house specialising in: Low latency, high throughput software 6 developers + 3 staff in Europe and USA. Sponsor HFT related open source projects Core Java engineering

About Me CEO and Principal Consultant Third on Stackoverflow for Java, most Java Performance answers. Vanilla Java blog with 3 million views Founder of the Performance Java User's Group An Australian, based in the U.K.

What Is The Problem We Solve? "I want to be able to read and write my data to a persisted, distributed system, with the speed of in memory data structures"

Agenda What are our Open Source products used for? Where to start with low latency? When you know you have a problem, what can you do?

Chronicle scaling Vertically and Horizontally Shares data structure between processes Replication between machines Build on a low level library Java Lang. Millions of operations per second. Micro-second latency. No TCP locally. Synchronous logging to the OS. Apache 2.0 available on GitHub Persisted via the OS.

What is Chronicle Map/Set? Low latency persisted key-value store. Latency between processes around 200 ns. In specialized cases, latencies < 25 ns. Throughputs up to 30 million/second.

What is Chronicle Map/Set? ConcurrentMap or Set interface Designed for reactive programming Replication via TCP and UDP. Apache 2.0 open source library. Pure Java, supported on Windows, Linux, Mac OSX.

What is Chronicle Queue? Low latency journaling and logging. Low latency cross JVM communication. Designed for reactive programming Throughputs up to 40 million/second.

What is Chronicle Queue? Latencies between processes of 200 nano-seconds. Sustain rates of 400 MB/s, peaks much higher. Replication via TCP. Apache 2.0 open source library. Pure Java, supported on Windows, Linux, Mac OSX.

Chronicle monitoring a legacy application

Chronicle journalling multiple applications

Chronicle for low latency trading

Short demo using OSResizesMain Note: The VIRT virtual memory size is 125t for 125 TB, actual usage 97M System memory: 7.7 GB, Extents of map: 137439.0 GB, disk used: 97MB, addressrange: 233a777f000-7f33a8000000 $ ls -lh /tmp/oversized* -rw-rw-r-- 1 peter peter 126T Oct 20 17:03 /tmp/over-sized... $ du -h /tmp/oversized* 97M /tmp/over-sized...

Where to start with low latency?

You need to measure first. What is the end to end use case you need to improve? Is it throughput or latency you need to improve? Throughput or average latency hides poor latencies. Avoid co-ordinated omission. See Gil Tene's talks.

Looking at the latency percentiles

Tools to help you measure A commercial profiler. e.g. YourKit. Instrument timings in production. Record and replay production loads. Avoid co-ordinated omission. See Gil Tene's talks. If you can't change the code, Censum can help you tune your GC pause times. Azul's Zing solves GC pause times, but has many other tools to reduce jitter.

What to look for when profiling Reducing the allocation rate is often a quick win. Memory profile to reduce garbage produced. When CPU profiling, leave the memory profiler on. If the profiler is no long helpful, application instrumentation can take it to the next level.

When you know you have a problem, what can you do about it?

Is garbage unavoidable? You can always reduce it further and further, but at some point it's not worth it. For a web service, 500 MB/s might be ok. For a trading system, 500 KB/s might be ok. If you produce 250 KB/s it will take you 24 hours to fill a 24 GB Eden space.

Common things to tune. A common source of garbage is Iterators. for (String s : arraylist) { } Creates an Iterator, however for (int i = 0, len = arraylist.size(); i < len; i++) { String s = arraylist.get(i); } Doesn't create an Iterator.

Common things to tune. BigDecimal can be a massive source of garbage. BigDecimal a = b.add(c).divide(bigdecimal.two, 2, ROUND_HALF_UP); The same as double produces no garbage. double a = round(b + c, 2); You have to have a library to support rounding. Without it you will get rounding and representation errors.

Be aware of your memory speeds. concurrency Clock cycles Seconds L1 caches multi-core 4 1 seconds L2 cache multi-core 10 3 seconds L3 cache socket wide 40-75 15 seconds Main memory System wide 200 50 seconds. SSD access System wide 50,000 14 hours Local Network Network 180,000 2 days HDD System wide 30,000,000 1 year. To maximise performance you want to spend as much time in L3, or ideally L1/L2 caches as possible.

Memory access is faster with less garbage Reducing garbage minimises filling your caches with garbage. If you are producing 300 MB/s of garbage your L1 cache will be filled with garbage is about 100 micro-seconds, your L2 cache will be filled in under 1 milli-second. The L3 cache and main memory shared and the more you use this, the less scalability you will get from your multi-cores.

Faster memory access Reduce garbage Reduce pointer chasing Use primitives instead of objects. Avoid false sharing for highly contended mutated values

Lock free coding AtomicLong.addAndGet(n) Unsafe.compareAndSwapLong Unsafe.getAndAddLong

Using off heap memory Reduce GC work load. More control over layout Data can be persisted Data can be embedded into multiple processes. Can exploit virtual memory allocation instead of main memory allocation.

Low latency network connections Kernel by pass network cards e.g. Solarflare Single hop latencies around 5 micro-seconds Make scaling to more machines practical when tens of micro-seconds matter.

Reducing micro-jitter Unless you isolate a CPU, it will get interrupted by the scheduler a lot. You can see delays of 1 5 ms every hour on an otherwise idle machine. On a virtualised machine, you can see delays of 50 ms. Java Thread Affinity library can declaratively layout your critical threads.

Reducing micro-jitter Number of interrupts per hour by length.

Q&A http://openhft.net/ Performance Java User's Group. @PeterLawrey peter.lawrey@higherfrequencytrading.com