Java in High-Performance Computing



Similar documents
Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

Java Coding Practices for Improved Application Performance

First Java Programs. V. Paúl Pauca. CSC 111D Fall, Department of Computer Science Wake Forest University. Introduction to Computer Science

Introduction to Java

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Crash Course in Java

Java Interview Questions and Answers

Can You Trust Your JVM Diagnostic Tools?

Extreme Performance with Java

Tail call elimination. Michel Schinz

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

Jonathan Worthington Scarborough Linux User Group

Habanero Extreme Scale Software Research Project

Characteristics of Java (Optional) Y. Daniel Liang Supplement for Introduction to Java Programming

CSC230 Getting Starting in C. Tyler Bletsch

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Chapter 7D The Java Virtual Machine

CS 106 Introduction to Computer Science I

Java CPD (I) Frans Coenen Department of Computer Science

Object Oriented Software Design

Advanced compiler construction. General course information. Teacher & assistant. Course goals. Evaluation. Grading scheme. Michel Schinz

Java Programming Fundamentals

Validating Java for Safety-Critical Applications

qwertyuiopasdfghjklzxcvbnmqwerty uiopasdfghjklzxcvbnmqwertyuiopasd fghjklzxcvbnmqwertyuiopasdfghjklzx cvbnmqwertyuiopasdfghjklzxcvbnmq

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

CS 141: Introduction to (Java) Programming: Exam 1 Jenny Orr Willamette University Fall 2013

Programming Languages

The Fundamentals of Tuning OpenJDK

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Software Engineering Techniques

Optimizing Generation of Object Graphs in Java PathFinder

C# and Other Languages

Introduction to Object-Oriented Programming

Chapter 1 Java Program Design and Development

1.00 Lecture 1. Course information Course staff (TA, instructor names on syllabus/faq): 2 instructors, 4 TAs, 2 Lab TAs, graders

Language Based Virtual Machines... or why speed matters. by Lars Bak, Google Inc

Effective Java Programming. efficient software development

Practical Performance Understanding the Performance of Your Application

Automated Repair of Binary and Assembly Programs for Cooperating Embedded Devices

Lecture 1 Introduction to Android

TESTING WITH JUNIT. Lab 3 : Testing

Oracle Corporation Proprietary and Confidential

Java Performance. Adrian Dozsa TM-JUG

Building Applications Using Micro Focus COBOL

An Overview of Java. overview-1

General Introduction

Sources: On the Web: Slides will be available on:

CPLEX Tutorial Handout

Tutorial: Getting Started

JDK 1.5 Updates for Introduction to Java Programming with SUN ONE Studio 4

picojava TM : A Hardware Implementation of the Java Virtual Machine

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Exploiting nginx chunked overflow bug, the undisclosed attack vector

11.1 inspectit inspectit

Object Oriented Software Design

Parrot in a Nutshell. Dan Sugalski dan@sidhe.org. Parrot in a nutshell 1

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

What s Cool in the SAP JVM (CON3243)

Using jvmstat and visualgc to Solve Memory Management Problems

Designing with Exceptions. CSE219, Computer Science III Stony Brook University

Holly Cummins IBM Hursley Labs. Java performance not so scary after all

02 B The Java Virtual Machine

Embedded Software Development

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

Software security. Buffer overflow attacks SQL injections. Lecture 11 EIT060 Computer Security

I Control Your Code Attack Vectors Through the Eyes of Software-based Fault Isolation. Mathias Payer, ETH Zurich

Troubleshoot the JVM like never before. JVM Troubleshooting Guide. Pierre-Hugues Charbonneau Ilias Tsagklis

Zend Server 4.0 Beta 2 Release Announcement What s new in Zend Server 4.0 Beta 2 Updates and Improvements Resolved Issues Installation Issues

Java Program Coding Standards Programming for Information Technology

C Compiler Targeting the Java Virtual Machine

Andreas Herrmann. AMD Operating System Research Center

Java Crash Course Part I

INTRODUCTION TO JAVA PROGRAMMING LANGUAGE

Basic Java Constructs and Data Types Nuts and Bolts. Looking into Specific Differences and Enhancements in Java compared to C

ROBOTICS AND AUTONOMOUS SYSTEMS

Java Monitoring. Stuff You Can Get For Free (And Stuff You Can t) Paul Jasek Sales Engineer

HOTPATH VM. An Effective JIT Compiler for Resource-constrained Devices

Performance Improvement In Java Application

A Thread Monitoring System for Multithreaded Java Programs

Topics. Introduction. Java History CS 146. Introduction to Programming and Algorithms Module 1. Module Objectives

Logging in Java Applications

AP Computer Science Java Subset

Replication on Virtual Machines

Garbage Collection in the Java HotSpot Virtual Machine

CSE 403. Performance Profiling Marty Stepp

Java in Virtual Machines on VMware ESX: Best Practices

CSC 551: Web Programming. Spring 2004

Identifying Performance Bottleneck using JRockit. - Shivaram Thirunavukkarasu Performance Engineer Wipro Technologies

Restraining Execution Environments

The Java Series. Java Essentials I What is Java? Basic Language Constructs. Java Essentials I. What is Java?. Basic Language Constructs Slide 1

ELEC 377. Operating Systems. Week 1 Class 3

The Java Virtual Machine (JVM) Pat Morin COMP 3002

Transcription:

Java in High-Performance Computing Dawid Weiss Carrot Search Institute of Computing Science, Poznan University of Technology GeeCon Poznań, 05/2010

Learn from the mistakes of others. You can t live long enough to make them all yourself. Eleanor Roosevelt

Talk outline What is High performance? What is Java? Measuring performance (benchmarking). HPPC library.

Talk outline What is High performance? What is Java? Measuring performance (benchmarking). HPPC library. Crosscutting: (un?)common pitfalls and performance killers. Some HotSpot internals.

Divide-and-conquer style algorithm for (Example e : examples) { e.hasquiz()? e.showquiz() : e.showcode(); e.explain(); e.deriveconclusions(); }

PART I High Performance Computing

High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems. Wikipedia

Is Java faster than C/C++? The short answer is: it depends. Cliff Click

It s usually hard to make a fast program run faster.

It s usually hard to make a fast program run faster. It s easy to make a slow program run even slower.

It s usually hard to make a fast program run faster. It s easy to make a slow program run even slower. It s easy to make fast hardware run slow.

For now, HPC limited allowed computation time, constrained resources (hardware, memory).

For now, HPC limited allowed computation time, constrained resources (hardware, memory). Good HPC software no (obvious) flaws.

PART II What is Java? (Recall: Is Java faster than C/C++?)

Example 1 public void testsum1() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += sum1(i, i); result = sum; } public void testsum2() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += sum2(i, i); result = sum; }

Example 1 public void testsum1() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += sum1(i, i); result = sum; } public void testsum2() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += sum2(i, i); result = sum; } where the body of sum1 and sum2 sums arguments and returns the result and COUNT is significantly large...

sun-1.6.0-20 VM sum1 sum2

VM sum1 sum2 sun-1.6.0-20 0.04

VM sum1 sum2 sun-1.6.0-20 0.04 2.62 sun-1.6.0-16

VM sum1 sum2 sun-1.6.0-20 0.04 2.62 sun-1.6.0-16 0.04 3.20 sun-1.5.0-18

VM sum1 sum2 sun-1.6.0-20 0.04 2.62 sun-1.6.0-16 0.04 3.20 sun-1.5.0-18 0.04 3.29 ibm-1.6.2

VM sum1 sum2 sun-1.6.0-20 0.04 2.62 sun-1.6.0-16 0.04 3.20 sun-1.5.0-18 0.04 3.29 ibm-1.6.2 0.08 6.28 jrockit-27.5.0

VM sum1 sum2 sun-1.6.0-20 0.04 2.62 sun-1.6.0-16 0.04 3.20 sun-1.5.0-18 0.04 3.29 ibm-1.6.2 0.08 6.28 jrockit-27.5.0 0.18 0.16 harmony-r917296

VM sum1 sum2 sun-1.6.0-20 0.04 2.62 sun-1.6.0-16 0.04 3.20 sun-1.5.0-18 0.04 3.29 ibm-1.6.2 0.08 6.28 jrockit-27.5.0 0.18 0.16 harmony-r917296 0.17 0.35 (averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum2 sum3 sum4 sun-1.6.0-20 0.04 2.62 1.05 3.76 sun-1.6.0-16 0.04 3.20 1.39 4.99 sun-1.5.0-18 0.04 3.29 1.46 5.20 ibm-1.6.2 0.08 6.28 0.16 14.64 jrockit-27.5.0 0.18 0.16 1.16 3.18 harmony-r917296 0.17 0.35 9.18 22.49 (averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

int sum1(int a, int b) { return a + b; } Integer sum2(integer a, Integer b) { return a + b; } Integer sum2(integer a, Integer b) { return Integer.valueOf( a.intvalue() + b.intvalue()); }

int sum3(int... args) { int sum = 0; for (int i = 0; i < args.length; i++) sum += args[i]; return sum; } Integer sum4(integer... args) { int sum = 0; for (int i = 0; i < args.length; i++) { sum += args[i]; } return sum; } Integer sum4(integer [] args) { //... }

Conclusions Syntactic sugar may be costly. Primitive types are fast. Large differences between different VMs.

Example 2 Write once, run anywhere!

But it s the same VM!

It works on my machine!

private static boolean ready; public static void startthread() { new Thread() { public void run() { try { sleep(2000); } catch (Exception e) { /* ignore */ } System.out.println("Marking loop exit."); ready = true; } }.start(); } public static void main(string[] args) { startthread(); System.out.println("Entering the loop..."); while (!ready) { // Do nothing. } System.out.println("Done, I left the loop!"); }

while (!ready) { // Do nothing. }? boolean r = ready; while (!r) { // Do nothing. }

while (!ready) { // Do nothing. }? boolean r = ready; while (!r) { // Do nothing. } In most cases true, from a JMM perspective.

JVM Internals...

C1: fast not (much) optimization C2: slow(er) than C1 a lot of JMM-allowed optimizations

There are hundreds of JVM tuning/diagnostic switches.

My personal favorite:

Conclusions Bytecode is far from what is executed. A lot going on under the (VM) hood. Bad code may work, but will eventually crash. HotSpot-level optimizations are good.

Conclusions Bytecode is far from what is executed. A lot going on under the (VM) hood. Bad code may work, but will eventually crash. HotSpot-level optimizations are good. If there is a bug in the HotSpot compiler...

Any other diversifying factors?

J2ME more VM vendors, hardware diversity, software and hardware quirks.

Non-JVM target platforms Dalvik GWT IKVM

Conclusions There is no single Java performance model. Performance depends on the VM, environment, class library, hardware. Apply benchmark-and-correct cycle.

Benchmarking

Example 3 public void testsum1() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += sum1(i, i); result = sum; } public void testsum1_2() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += sum1(i, i); }

sun-1.6.0-20 VM sum1 sum1_2

VM sum1 sum1_2 sun-1.6.0-20 0.04

VM sum1 sum1_2 sun-1.6.0-20 0.04 0.00

VM sum1 sum1_2 sun-1.6.0-20 0.04 0.00 sun-1.6.0-16 0.04 0.00 sun-1.5.0-18 0.04 0.00 ibm-1.6.2 0.08 0.01 jrockit-27.5.0 0.17 0.08 harmony-r917296 0.17 0.11 (averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation...

java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation... - method holder: com/dawidweiss/geecon2010/example03 - access: 0xc1000001 public - name: testsum1_2... 010 pushq rbp subq rsp, #16 # Create frame nop # nop for patch_verified_entry 016 addq rsp, 16 # Destroy frame popq rbp testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC 021 ret

Conclusions Benchmarks must be executed to provide feedback. HotSpot is smart and effective at removing dead code.

Example 4 @Test public void testadd1() { int sum = 0; for (int i = 0; i < COUNT; i++) { sum += add1(i); } guard = sum; } public int add1(int i) { return i + 1; } Note add1 is virtual.

switch testadd1 -XX:+Inlining -XX:+PrintInlining 0.04 -XX:-Inlining? (averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200, JRE 1.7b80-debug).

switch testadd1 -XX:+Inlining -XX:+PrintInlining 0.04 -XX:-Inlining 0.45 (averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200, JRE 1.7b80-debug).

Most Java calls are monomorphic.

HotSpot adjusts to megamorphic calls automatically.

Example 5 abstract class Superclass { abstract int call(); } class Sub1 extends Superclass { int call() { return 1; } } class Sub2 extends Superclass { int call() { return 2; } } class Sub3 extends Superclass { int call() { return 3; } } Superclass[] mixed = initwithrandominstances(10000); Superclass[] solid = initwithsub1instances(10000); @Test public void testmonomorphic() { int sum = 0; int m = solid.length; for (int i = 0; i < COUNT; i++) sum += solid[i % m].call(); guard = sum; } @Test public void testmegamorphic() { int sum = 0; int m = mixed.length; for (int i = 0; i < COUNT; i++) sum += mixed[i % m].call(); guard = sum; }

VM monomorphic megamorphic sun-1.6.0-20 0.19 0.32 sun-1.6.0-16 0.19 0.34 sun-1.5.0-18 0.18 0.34 ibm-1.6.2 0.20 0.30 jrockit-27.5.0 0.22 0.29 harmony-r917296 0.27 0.32 (averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

Example 6 @Test public void testbitcount1() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += Integer.bitCount(i); guard = sum; } @Test public void testbitcount2() { int sum = 0; for (int i = 0; i < COUNT; i++) sum += bitcount(i); guard = sum; } /* Copied from * {@link Integer#bitCount} */ static int bitcount(int i) { // HD, Figure 5-2 i = i - ((i >>> 1) & 0x55555555); i = (i & 0x33333333) + ((i >>> 2) & 0x33333333); i = (i + (i >>> 4)) & 0x0f0f0f0f; i = i + (i >>> 8); i = i + (i >>> 16); return i & 0x3f; }

VM testbitcount1 testbitcount2 sun-1.6.0-20 0.43 0.43 sun-1.7.0-b80 0.43 0.43 (averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM testbitcount1 testbitcount2 sun-1.6.0-20 0.43 0.43 sun-1.7.0-b80 0.43 0.43 (averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200). VM testbitcount1 testbitcount2 sun-1.6.0-20 0.08 0.33 sun-1.7.0-b83 0.07 0.32 (averages in sec., 10 measured rounds, 5 warmup, 64-bit Windows 7, Intel I7 860).

... -XX:+PrintInlining...

... -XX:+PrintInlining...... Inlining intrinsic _bitcount_i at bci:9 in..example06::testbitcount1 Inlining intrinsic _bitcount_i at bci:9 in..example06::testbitcount1 Inlining intrinsic _bitcount_i at bci:9 in..example06::testbitcount1 Example06.testBitCount1: [measured 10 out of 15 rounds] round: 0.07 [+- 0.00], round.gc: 0.00 [+- 0.00]... @ 9 com.dawidweiss.geecon2010.example06::bitcount inline (hot) @ 9 com.dawidweiss.geecon2010.example06::bitcount inline (hot) @ 9 com.dawidweiss.geecon2010.example06::bitcount inline (hot) Example06.testBitCount2: [measured 10 out of 15 rounds] round: 0.32 [+- 0.01], round.gc: 0.00 [+- 0.00]...

... -XX:+PrintOptoAssembly...

... -XX:+PrintOptoAssembly... {method} - klass: {other class} - method holder: com/dawidweiss/geecon2010/example06 - name: testbitcount1... 0c2 B13: # B12 B14 <- B8 B12 Loop: B13-B12 inner stride:... 0c2 movl R10, RDX # spill... 0e1 movl [rsp + #40], R11 # spill 0e6 popcnt R8, R8... 0f5 addl R9, #7 # int 0f9 popcnt R11, R11 0fe popcnt RCX, R9

Conclusions Benchmarks must be statistically sound. averages, variance, min, max, warm-up phase Account for HotSpot optimisations. Account for hardware differences. test-on-target Use domain data and real scenarios. Inspect suspicious output with debug JVM. See more: Cliff Click, http://java.sun.com/javaone/2009/articles/rockstar_click.jsp.

HPPC High Performance Primitive Collections

Motivation Primitive types: fast and memory-friendly. Optional assertions. Single-threaded. No fail-fast. Fast, fast, fast iterators, with no GC overhead. Open internals (explicit implementation). Programmers know what they re doing.

Why not JCF? public interface List<E> extends Collection<E> { boolean contains(object o); // [-] contract-enforced methods Iterator<E> iterator(); // [-] iterators over primitive types? Object[] toarray(); // [-] troublesome covariants...

Friendly Competition fastutil PCJ GNU Trove Apache Mahout (ported COLT) Apache Primitive Collections All of these have pros and cons and deal with JCF compatibility somehow.

Iterators in fastutil or PCJ interface IntIterator extends Iterator<Integer> { // Primitive-specific method int nextint(); }

Iterators in HPPC public final class IntCursor { public int index; public int value; } public class IntArrayList extends Iterable<IntCursor> { Iterator<IntCursor> iterator() {... } }

Iterating over list elements in HPPC for (IntCursor c : list) { System.out.println(c.index + ": " + c.value); }

Iterating over list elements in HPPC for (IntCursor c : list) { System.out.println(c.index + ": " + c.value); }...or list.foreach(new IntProcedure() { public void apply(int value) { System.out.println(value); } });

Iterating over list elements in HPPC for (IntCursor c : list) { System.out.println(c.index + ": " + c.value); }...or list.foreach(new IntProcedure() { public void apply(int value) { System.out.println(value); } });...or final int [] buffer = list.buffer; final int size = list.size(); for (int i = 0; i < size; i++) { System.out.println(i + ": " + buffer[i]); }

The fastest one?

What s in HPPC?

Open implementation is good.

/** * Applies a supplemental hash function to a given * hashcode, which defends against poor quality * hash functions. [...] */ static int hash(int h) { // This function ensures that hashcodes that differ only by // constant multiples at each bit position have a bounded // number of collisions (approximately 8 at default load factor). h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); } HashMap rehashes your (carefully crafted) hash code.

HPPC approach (example): public class LongIntOpenHashMap implements LongIntMap { //... public LongIntOpenHashMap(int initialcapacity, float loadfactor, LongHashFunction keyhashfunction, IntHashFunction valuehashfunction) { //... } Defaults: LongMurmurHash, IntHashFunction.

Example 7 Frequency count of character bigrams in a given text.

HPPC: final char [] CHARS = DATA; final IntIntOpenHashMap counts = new IntIntOpenHashMap(); for (int i = 0; i < CHARS.length - 1; i++) { counts.putoradd((chars[i] << 16 CHARS[i + 1]), 1, 1); } JCF, boxed integer types. final Integer currentcount = map.get(bigram); map.put(bigram, currentcount == null? 1 : currentcount + 1); JCF, with IntHolder (mutable value object). GNU Trove map.adjustorputvalue(bigram, 1, 1); fastutil, OpenHashMap and LinkedOpenHashMap map.put(bigram, map.get(bigram) + 1); PCJ, OpenHashMap and ChainedHashMap

Is Java faster than C/C++? The short answer is: it depends. Cliff Click

Example 8 The same algorithm for building a DFSA automaton accepting a set of strings. Input: 3 565 575 strings, 158M of text.

Example 8 The same algorithm for building a DFSA automaton accepting a set of strings. Input: 3 565 575 strings, 158M of text. real user sys gcc -O2 java 1.6.0_20-64

Example 8 The same algorithm for building a DFSA automaton accepting a set of strings. Input: 3 565 575 strings, 158M of text. gcc -O2 real 63.850s user 63.110s sys 0.240s java 1.6.0_20-64

Example 8 The same algorithm for building a DFSA automaton accepting a set of strings. Input: 3 565 575 strings, 158M of text. gcc -O2 java 1.6.0_20-64 real 63.850s 43.197s user 63.110s 46.370s sys 0.240s 0.840s

Summary and Conclusions

Performance checklist (sanity check) Algorithms, algorithms, algorithms. Proper data structures. Spurious GC activity. Memory barriers in tight loops. CPU cache utilization. Low-level, hotspot-specific code structuring.

HPPC and junit-benchmarks are at: http://labs.carrotsearch.com