Distributed Java Virtual Machine

Similar documents

JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support

Cloud Computing. Up until now

Global Accessible Objects (GAOs) in the Ambicomp Distributed Java Virtual Machine

General Introduction

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Chapter 6, The Operating System Machine Level

Network Attached Storage. Jinfeng Yang Oct/19/2015

Jonathan Worthington Scarborough Linux User Group

Chapter 14 Virtual Machines

DISK: A Distributed Java Virtual Machine For DSP Architectures

Garbage Collection in the Java HotSpot Virtual Machine

Resource Utilization of Middleware Components in Embedded Systems

System Structures. Services Interface Structure

Validating Java for Safety-Critical Applications

CSCI E 98: Managed Environments for the Execution of Programs

Cloud Operating Systems for Servers

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Chapter 2: Remote Procedure Call (RPC)

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Multi-core Programming System Overview

Java Virtual Machine: the key for accurated memory prefetching

Transparent Redirection of Network Sockets 1

A Java-based system support for distributed applications on the Internet

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service

An Implementation Of Multiprocessor Linux

Principles and characteristics of distributed systems and environments

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Distributed File Systems

Chapter 3 Operating-System Structures

Full and Para Virtualization

Replication on Virtual Machines

Semester Thesis Traffic Monitoring in Sensor Networks

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

- An Essential Building Block for Stable and Reliable Compute Clusters

An Easier Way for Cross-Platform Data Acquisition Application Development

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment

Chapter 3: Operating-System Structures. Common System Components

JPURE - A PURIFIED JAVA EXECUTION ENVIRONMENT FOR CONTROLLER NETWORKS 1

VXLAN: Scaling Data Center Capacity. White Paper

3.5. cmsg Developer s Guide. Data Acquisition Group JEFFERSON LAB. Version

11.1 inspectit inspectit

picojava TM : A Hardware Implementation of the Java Virtual Machine

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

A Packet Forwarding Method for the ISCSI Virtualization Switch

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Fachbereich Informatik und Elektrotechnik SunSPOT. Ubiquitous Computing. Ubiquitous Computing, Helmut Dispert

VMware Server 2.0 Essentials. Virtualization Deployment and Management

High Availability Solutions for the MariaDB and MySQL Database

Tool - 1: Health Center

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

B) Using Processor-Cache Affinity Information in Shared Memory Multiprocessor Scheduling

COS 318: Operating Systems. Virtual Machine Monitors

Relational Databases in the Cloud

Java Garbage Collection Basics

12. Introduction to Virtual Machines

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Operating Systems 4 th Class

Real-time Java Processor for Monitoring and Test

Example of Standard API

VMWARE WHITE PAPER 1

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Introduction to Virtual Machines

OpenFlow Based Load Balancing

Can You Trust Your JVM Diagnostic Tools?

1 Organization of Operating Systems

Virtual Machine Learning: Thinking Like a Computer Architect

COS 318: Operating Systems. Virtual Machine Monitors

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization

MOSIX: High performance Linux farm

Chapter 2 TOPOLOGY SELECTION. SYS-ED/ Computer Education Techniques, Inc.

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Load balancing using Remote Method Invocation (JAVA RMI)

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

Virtual Machines.

IBM SDK, Java Technology Edition Version 1. IBM JVM messages IBM

The Design of the Inferno Virtual Machine. Introduction

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

Operating System Overview. Otto J. Anshus

ELIXIR LOAD BALANCER 2

Cloud Networking Disruption with Software Defined Network Virtualization. Ali Khayam

SAN Conceptual and Design Basics

Apache Thrift and Ruby

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Chapter 2 System Structures

A Tool for Evaluation and Optimization of Web Application Performance

Laboratory Report. An Appendix to SELinux & grsecurity: A Side-by-Side Comparison of Mandatory Access Control & Access Control List Implementations

Linux Distributed Security Module 1

Transparent Redirection of Network Sockets 1

COM 444 Cloud Computing

Transcription:

Master Thesis Distributed Java Virtual Machine Ken H. Lee Mathias Payer Responsible assistant Prof. Thomas R. Gross Laboratory for Software Technology ETH Zurich August 2008

Abstract Because Java has a built-in support for multi-threading and thread-synchronisation, parallel Java applications that are traditionally executed on a single Java Virtual Machine (JVM) can be constructed easily. Because of the emergence of cluster computing that are a cost-effective substitute for multi-core computers, there have been different approaches for developing a distributed runtime environment to run Java application in parallel. In this thesis we added a distributed runtime system to an open-source JVM called Jikes Research Virtual Machine to provide a true parallel execution environment for multi-threaded Java applications within a cluster of workstations. To achieve transparency, we implemented a virtual global object space to hide the underlying distribution from the programmer and the application. Supplementarily, we added mechanisms for accessing objects in this global memory space, thread synchronization, distributed classloading, distributed scheduling and finally for I/O redirection and evaluated the overhead of these mechanisms with several microbenchmarks. The result is a prototype of a distributed Java Virtual Machine running as part of the JikesRVM which offers a wide scope for extensions. By embedding the virtual object heap into the JVM and making the JVM aware of the cluster, the distributed runtime system benefits from the abundant runtime information of the JVM which gives the opportunity for further optimizations. iii

Zusammenfassung Dank der eingebauten Unterstützung für Multithreading und Thread-Synchronisation können Java Applikationen ohne grossen Aufwand parallelisiert werden, welche jedoch gewöhnlicherweise nur auf einer Java Virtual Machine (JVM) laufen. Mit dem Aufkommen des Cluster Computings, welches eine kostengünstige Alternative zu Multiprozessor-Rechnern darstellt, gab es verschiedene Ansätze, eine verteilte Laufzeitumgebung zu entwickeln, um Java Applikationen echt parallel laufen zu lassen. In dieser Arbeit haben wir die open-source JVM namens Jikes Research Virtual Machine um ein verteiltes Laufzeitsystem erweitert, welches parallele Java Applikationen ermöglicht, in einem Cluster echt parallel laufen zu lassen. Wir haben einen virtuellen globalen Speicher implementiert, um die darunterliegende Verteilung vor dem Programmierer und der Applikation zu verbergen. Zusätzlich haben wir Mechanismen für den Zugriff auf Objekten in diesem globalen Adressraum, für die Synchronisation von Threads, für das verteilte Klassenladen und verteilte Scheduling und für die I/O Weiterleitung implementiert und haben fr diese Mechanismen die Zusatzkosten evaluiert. Mit unserer Implementation haben wir einen Prototypen einer verteilten JVM geschaffen, welcher innerhalb einer stabilen JVM läuft und viel Spielraum für Erweiterungen zulässt. Dadurch, dass der virtuelle globale Speicher in die JVM eingebettet wurde und die JVM sich der Existenz des Clusters bewusst ist, profitiert das verteilte Laufzeitsystem von den vielen Laufzeitinformationen der JVM. Dies ermöglicht zusätzliche Optimierungen, um die Leistung zu steigern. v

Acknowledgements I would like to take this opportunity to express my gratitude towards the following people: First of all, I want to thank Professor Thomas Gross for being my mentor and giving me the chance to explore an interesting research topic. I would also like to thank my supervisor Mathias Payer for his guidelines and advices during our weekly meetings. He has provided me with many inspiring ideas for this work. My vote of thanks also goes to the Jikes Research Virtual Machine community, especially Ian Rogers, for their prompt help and support. Last but not least, I would like to thank my friends Michael Gubser and Yves Alter for revising the linguistic aspects of this documentation. vii

Table of Contents List of Tables xii List of Figures xiii 1 Introduction 1 2 Background and Preliminary Work 3 2.1 Jikes Research Virtual Machine.............................. 3 2.1.1 Baseline and Optimizing Compilers........................ 3 2.1.2 Threading Model.................................. 4 2.1.3 Object Model.................................... 5 2.2 Distributed Shared Memory................................ 6 2.2.1 Page-Based Distributed Shared Memory..................... 7 2.2.2 Object-Based Distributed Shared Memory.................... 7 2.3 Memory Consistency Model................................ 7 2.3.1 Lazy Release Consistency Model......................... 8 2.3.2 Java Memory Model................................ 9 3 Related Work 11 3.1 Compiler-Based Distributed Runtime Systems...................... 12 3.1.1 Hyperion...................................... 12 3.2 Systems Using Standard JVMs.............................. 12 3.2.1 JavaSplit...................................... 13 3.3 Cluster-Aware JVMs.................................... 13 3.3.1 Java/DSM..................................... 14 ix

x Table of Contents 3.3.2 cjvm........................................ 14 3.3.3 JESSICA2...................................... 14 3.3.4 djvm........................................ 14 3.4 Conclusion......................................... 15 4 Design 17 4.1 Overview.......................................... 17 4.2 Home-Based Lazy Release Consistency Model...................... 18 4.2.1 Adaption to Our DSM............................... 18 4.3 Communication....................................... 19 4.3.1 Protocol....................................... 19 4.4 Shared Objects....................................... 20 4.4.1 Definition...................................... 20 4.4.2 Globally Unique ID................................ 21 4.4.3 Detection...................................... 21 4.4.4 States........................................ 22 4.4.5 Conclusion..................................... 24 4.5 Distributed Classloading.................................. 24 4.6 Scheduling.......................................... 26 4.6.1 Synchronization.................................. 26 4.7 I/O Redirection....................................... 27 4.8 Garbage Collection..................................... 27 5 Implementation 29 5.1 Messaging Model...................................... 29 5.2 Boot Process........................................ 30 5.3 Shared Object Space.................................... 32 5.3.1 Shared Objects................................... 32 5.3.2 GUID Table..................................... 33 5.3.3 Faulting Scheme.................................. 34

Table of Contents xi 5.3.4 Cache Coherence Protocol............................. 36 5.3.5 Statics........................................ 37 5.4 Distributed Classloader.................................. 40 5.4.1 Class Replication Stages.............................. 41 5.5 Distributed Scheduling................................... 42 5.5.1 Thread Distribution................................ 42 5.5.2 Thread and VM Termination........................... 43 5.6 Thread Synchronization.................................. 44 5.7 I/O Redirection....................................... 45 6 Benchmarks 47 6.1 Test Environment...................................... 47 6.2 Test Application Suite................................... 47 6.3 Performance Evaluation.................................. 49 7 Conclusions and Future Work 53 7.1 Problems.......................................... 53 7.2 Future Work........................................ 55 7.2.1 Object Home Migration.............................. 55 7.2.2 Object Prefetching................................. 56 7.2.3 Thread Migration................................. 57 7.2.4 Fast Object Transfer Mechanism......................... 57 7.2.5 VM and Application Object Separation..................... 58 7.2.6 Garbage Collector Component.......................... 58 7.2.7 I/O Redirection for Sockets............................ 58 7.2.8 Volatile Field Checks................................ 58 7.2.9 Trap Handler Detection of Faulting Accesses.................. 58 7.2.10 Software Checks in the Optimizing Compiler.................. 59 7.3 Conclusions......................................... 59

xii Table of Contents A Appendix 61 A.1 DJVM Usage........................................ 61 A.2 DJVM Classes....................................... 62 A.2.1 CommManager................................... 62 A.2.2 Message....................................... 62 A.2.3 SharedObjectManager............................... 62 A.2.4 GUIDMapper.................................... 63 A.2.5 LockManager.................................... 63 A.2.6 DistributedClassLoader.............................. 63 A.2.7 DistributedScheduler................................ 63 A.2.8 IORedirector.................................... 63 A.3 Original Task Assignment................................. 64 Bibliography 66

List of Tables 3.1 Distributed runtime systems overview.......................... 11 4.1 Protocol header....................................... 20 6.1 Access on node-local and shared objects......................... 50 6.2 Overhead thread allocation, classloading and I/O redirection............. 50 xiii

List of Figures 2.1 LRC protocol........................................ 8 4.1 DJVM overview....................................... 17 4.2 Lazy detection of shared objects example........................ 23 4.3 Shared object states.................................... 24 4.4 Classloading states..................................... 25 5.1 Cluster boot process.................................... 31 5.2 GUID table example.................................... 34 5.3 Cache coherence protocol................................. 38 5.4 Thread distribution..................................... 43 5.5 Acquire lock on a shared object.............................. 45 6.1 Thread synchronization time............................... 51 7.1 Thread representation................................... 53 A.1 Message class hierarchy.................................. 62 A.2 Original task assignment page 1.............................. 64 A.3 Original task assignment page 2.............................. 65 xv

1 Introduction Since its introduction in 1995, Java has become one of the most popular programming languages. Considered as a productive language, the performance of Java applications has been insufficient for a long time because of the poor performance of the Java Virtual Machine (JVM) that runs the application. Due to recent progress in compilation, garbage collection, distributed computing, etc. Java has even become attractive for high performance applications such as multi-threaded server applications that require computational power. One way is to run computation-intensive applications on multi-core computers that are expensive. Motivated by the emergence of cluster computing that is a cost-effective substitute for such dedicated parallel computers, the idea of running multi-threaded applications truly in parallel and transparently within a cluster has become a discussed research topic. Although current JVMs are limited to a single machine, they could be extended to multiple machines due to their virtual design and the fact that they do not have any special hardware bounds. The general idea of a distributed JVM, a JVM running on a cluster, is to distribute Java threads among the cluster nodes to gain a higher degree of execution parallelism. In order to provide transparency for a shared memory abstraction for Java threads, we added a global object space. We move objects into this space if they can be accessed by multiple threads on different machines allowing threads to be scheduled on any node based on load and locality information. In our thesis, we extended the Jikes Research Virtual Machine to run within a cluster. In contrast to other approaches such as JESSICA and Java/DSM [24, 30] that use a paged-based distributed shared memory, we embedded our Shared Object Space, a virtual global object heap, directly into the JVM to make it cluster-aware which gives us more ways for further optimizations. We defined a cache coherence protocol to deal with objects in the Shared Object Space and in the node s local object heap and added support for distributed object synchronization. Additionally, we developed mechanisms for classloading to achieve a consistent view on all nodes and scheduling for distributing threads among the cluster. Finally, we redirected all I/O operations to a single node in the cluster handling all I/O for transparency issues for the programmer and the Java application itself 1. Chapter 2 introduces the background information that is needed to understand the design decisions we made for our system presented in Chapter 4 where we talk about the concept of shared objects 1 Imagine that several threads are distributed among several cluster nodes and each thread calls System.out.println(). Transparency is achieved if all console outputs are printed on a single screen of a certain node. 1

2 1. Introduction and the cache coherence protocol. In Chapter 3 we will show related work in the area of distributed JVMs. We discuss different approaches and techniques to build a distributed runtime environment to run multi-threaded Java applications and compare them with our work. In Chapter 5 we give some implementation details of our distributed JVM and we measured the performance and show benchmark results in Chapter 6. Finally in Chapter 7, we conclude, discuss about problems we encountered during the work and present several further work to our distributed JVM such as different optimization techniques.

2 Background and Preliminary Work To understand the design decisions we made for our virtual global object heap embedded in our chosen JVM more clearly, we will give a short overview about the used JVM and an introduction about distributed shared memory systems in software on which our global object heap is based on. Finally, we talk about the Java Memory Model that is needed to clarify the cache coherence protocol used in our implementation. 2.1 Jikes Research Virtual Machine The Jikes Research Virtual Machine (JikesRVM) [3] is an open-source project of a Java Virtual Machine. The JikesRVM has evolved from the Jalapeño Virtual Machine [7, 8] developed at IBM Watson Laboratories. A special characteristic about the JikesRVM is that it has been written nearly entirely 1 in Java itself. The JikesRVM code base provides a stable and well-maintained framework for researchers and gives them the opportunity to experiment with a variety of design and implementation alternatives. Due to many publications 2 released over the last ten years and due to a broad and active community enhancing the code base, the JikesRVM has become a stateof-the-art Java Virtual Machine. 2.1.1 Baseline and Optimizing Compilers Within the JikesRVM the Java bytecode is not interpreted, instead each method is compiled from bytecode into native machine code by using either the Baseline or the Optimizing compiler. The Baseline compiler can be considered as a fast pattern-matching compiler, i.e. the bytecode instructions are matched against several patterns and then the compiler emits the corresponding machine code. Since the Baseline compiler translates the bytecode in a stack machine manner of the JVM specification [21], the resulting native code runs slower because its quality is similar to an interpreter. If a simple assignment such as x = y + z is considered, it gets translated into the native code as shown in Listing 2.1 as pseudocode. 1 Some part of the code has been written in C such as the bootloader for launching the JVM or some relay code for performing system calls. 2 http://jikesrvm.org/publications 3

4 2. Background and Preliminary Work Listing 2.1: Pseudocode produced by the Baseline compiler for a simple assignment 1 load ( y, r1) // load y into register r1 2 push ( r1) // push r1 onto stack 3 load (z, r2) 4 push (r2) 5 pop ( r1) // pop value from stack into r1 6 pop (r2) 7 add (r1, r2, r3) // r1+r2 = r3 8 push (r3) 9 pop (r3) 10 store ( r3, x) // store value in r3 to variable x In order to improve the performance, the Optimizing compiler utilizes the registers more effectively and uses some advanced optimizations techniques such as dead code removal to produce competitive performance [7]. The Optimizing compiler implementation exceeds the Baseline compiler in size and complexity. The JikesRVM allows an adaptive configuration where methods are initially compiled by the Baseline compiler since the translation occurs more quickly. Frequently used or computationally intensive methods are identified by the adaptive optimization system of the JikesRVM and are recompiled with the Optimizing compiler for performance improvements (cf. [3] for more details). 2.1.2 Threading Model In the JVM specification [21] the behaviour of Java threads is defined, but there are no constraints about the underlying threading model, i.e. how the threads behaviour must be implemented on a specific operating system and architecture. Basically, there are two known threading models used in several JVM implementations: Green threads model and native threads model. Green threads are scheduled by the VM and are managed in user space. Since Green threads emulate a multi-threaded environment without using any native operating system capabilities, they are lightweight but also inefficient. From the underlying operating system s point of view only one thread exists, i.e. the VM itself. As a consequence, Green threads cannot be assigned to multiple processors since they are not running at OS level, therefore they are bound to be executed within a single JVM running on a single processor. On the other hand, native threads are created in kernel space which allows them to benefit from a multi-core system. This can result in performance improvements since the execution to blocking I/O system calls will not cause the JVM to be stalled because only the native thread waiting for the I/O result is blocked. The Green threads model was the only model used in Java 1.1 for the HotSpot JVM. Due to the limitations described above, subsequent Java versions dropped Green threads in favor of native threads 3. The threading model used in the JikesRVM is an M:N threading model that implements both 3 http://en.wikipedia.org/wiki/green_threads

2. Background and Preliminary Work 5 threading models in particular. Java threads, i.e. threads from the Java applications 4 and VM daemon threads such as mutator and garbage collector threads, are multiplexed onto one or more virtual processors, each of them bound to a POSIX thread. The number of virtual processors can be defined on the command line. Per default only one virtual processor is created. The JikesRVM runtime system schedules an arbitrarily large number M of Java threads over a finite number N virtual processors. As a result, at most N Java threads can be executed concurrently. The M:N threading model benefits from the fact that the runtime system can only use N POSIX threads from the underlying operating system. As a consequence, the OS cannot be swamped by Java threads, crowding out system and other application threads as it would happen in the native threads model that corresponds to a 1:1 thread model. However, the main reason for using Green threads is that the JikesRVM utilizes them for garbage collection. After some bytecode instructions each Green thread checks its yieldpoint condition that is inserted during method compilation. Only if the condition evaluates to true, the Green thread is allowed to switch to another thread which is a pro-active thread switching process. This approach makes garbage collection easier because the collector thread knows the state and stackframe of each mutator which are stored in a GC map, therefore the computation of the root set is easier. With the M:N threading model the JikesRVM does also benefit from the system-wide thread management since it requires synchronizing on at most N active threads instead of an unbounded number of M threads. E.g. a stop-the-world garbage collector merely needs to flag the N currently running threads that they should switch into a collector thread rather than having to stop every mutator thread in the system. A drawback of the M:N threading model however is that many native I/O operations are blocking so that the calling native thread is blocked until the operation is completed. As a result, the virtual processor is blocked and not able to schedule further threads. To avoid this problem, blocking I/O operations are intercepted by the JikesRVM runtime system and replaced by non-blocking operations. The calling thread is suspended and placed into an I/O queue. The scheduler periodically polls for pending I/O operations and after they complete, the calling thread is removed from the I/O queue and queued for execution [3]. 2.1.3 Object Model An object model defines how an object 5 is represented in memory. The introduced object model in the JikesRVM fulfills some requirements such as fast instance field and array accesses, fast method dispatching, small object header size, etc. The object header in the default object model consists of two words for objects and three words for arrays which store the following components: TIB Pointer: The Type Information Block (TIB) pointer is stored in the first word and contains information about all objects of a certain type. A TIB consists of the virtual method 4 In the thesis, the term user thread is sometimes used to refer to Java threads created by the application. 5 An object is logically composed of two parts: an object header and the object s instance fields (or array elements).

6 2. Background and Preliminary Work table, a pointer to an object representing the type and pointers to a few data structures for efficient interface invocation and dynamic type checking. Hash Code: Every Java object has an identity hash code. By default the hash code is the object s memory address. Due to the fact that some garbage collectors copy objects during garbage collection, the memory address of the object changes. Since the hash code must remain the same, space in the object header (part of the second word) may be used to store the original hash code value. Lock: Each Java object has an associated lock state that could either be a pointer to a lock object or a direct representation of a lock. Garbage Collection Information: Garbage collectors may store some information in an object such as mark bits, etc. The allocated space for this type of information is located in the second word of the header. Array Length (for array objects): The number of elements of an array is stored in the third word. Miscellaneous Fields: For additional information, the object header can be expanded by several words in which for example profiling information can be stored. Note that Miscellaneous Fields are not available in the default object model. Object fields are laid out in the order they are declared in the Java source file with some exceptions to improve alignment. double and long fields (and object references in a 64bit architecture) are 8-byte aligned and are laid out first so that holes in the field layout alignment are avoided for these fields. For other fields whose sizes are four bytes or smaller, holes might be created to improve the alignment. E.g. for an int field followed by a byte, a three byte hole following the byte field is created to keep the int field 4-byte aligned. 2.2 Distributed Shared Memory Software distributed shared memory systems (DSM 6 ) virtualize a global shared memory abstraction across physically connected machines. DSM systems can be classified into two categories that are based on their granularity of the shared memory region. In page-based DSM systems shared memory is organized as virtual memory pages of a fixed size. The object-based approach uses shared memory region as an abstract space for storing shareable objects of variable size. Additionally, a coherence protocol combined with a memory consistency model maintains the memory coherence. 6 In the following the term DSM is used to refer to software DSM systems only.

2. Background and Preliminary Work 7 2.2.1 Page-Based Distributed Shared Memory Page-based DSM systems utilize a Memory Management Unit (MMU) that detects faulting access to an invalid or not available shared page. The MMU handles the faulting access by getting a valid copy from the machine where the shared page is located. The benefit of the MMU is that it only intercepts faulting accesses whereas normal accesses do not have to be handled. A drawback of page-based DSM systems however is the false sharing problem that arises when two processes access mutually exclusive parts of a shared page. In this case two virtual memory pages, that usually have fixed size of 4KB per default, are mapped against the same shared page, resulting in two different views of the same page. 2.2.2 Object-Based Distributed Shared Memory The object-based DSM approach reduces the false sharing problem of the page-based DSM approach since the granularity is an object of an arbitrary size. Contention happens only if two processors access the same object. False contention is not possible however. Since the MMU cannot trap accesses to objects of variable size, software checks must be inserted upon every access to an object. 2.3 Memory Consistency Model A memory consistency model contains a set of rules that specify a contract between the programmer and the memory system. As long as these specifications are followed, the memory system will be consistent and the result of the memory operations are predictable. In particular, the model defines which value might be read from the shared memory, to which other threads could have written before. programmer. Sequential Consistency is one of the most intuitive consistency models for the The model defines that every processor in the system sees the write operations on the same memory part, meaning that the processor follows the program order execution and that all write operations are visible to the other processors. Due to this fact, the Sequential Consistency has a poor performance since compiler optimizations such as reordering of memory accesses to different addresses cannot be performed. In a DSM system, this results in excessive communication traffic. Therefore, several models have been introduced to relax the memory order constraints of Sequential Consistency. The idea of relaxed memory consistency models such as the Release Consistency 7 is to reduce the impact of remote memory access latency in a DSM system. In the following subsection the Lazy Release Consistency model (LRC) is introduced which is the base model for our implementation of the DSM. We also give a description of the memory consistency model used in the Java programming language to clarify the understanding when we compare our implemented model with the Java Memory Model later. 7 Release Consistency is a form of relaxed memory consistency that allows the effects of shared memory accesses to be delayed until certain specially labeled accesses occur [20].

8 2. Background and Preliminary Work 2.3.1 Lazy Release Consistency Model Lazy Release Consistency is an algorithm that implements Release Consistency in such a way that propagations of modifications to the DSM is postponed until the time the modifications becomes necessary, i.e. the propagations are sent lazily. The goal is to reduce the number of exchanged messages. In LRC, accesses to memory are distinguished between sync and nsync, where sync accesses are further divided into acquires and releases. An acquire operation is used to enter a critical section and a release operation is used when a critical section is about to be exited. The time when a modification to the DSM becomes necessary is when an acquire operation is executed by a processor P1 as shown in Figure 2.1. P1 P2 acquire(x) write(x) release(x) send diffs(x) acquire(x) read(x) release(x) Figure 2.1: LRC protocol. Before entering the critical section, the last processor P2 that released the lock must send a notice of all writes, that were cached in P2 before exiting the critical section, to P1 such that the cached values in P1 become up-to-date. Another optimization to limit the amount of data exchanged is to send a diff of the modification to the DSM. A diff describes the modifications made to a memory page, which are then merged with the other cached copies [20].

2. Background and Preliminary Work 9 2.3.2 Java Memory Model The Java language was designed with multi-threading support which requires an underlying shared memory (the object heap) on which Java threads can interact. The memory consistency model for Java is called Java Memory Model (JMM) and defines the Java Virtual Machine s thread interaction with a shared main memory (cf. [21] for a full specification). In JMM, all threads share a main memory, which contains the master copy of every variable. In Java this can either be an object field, an array element or a static field. Every thread has its own working memory that represents a private cache in which a working copy of all variables that the thread uses are kept, i.e. when a variable in the main memory is accessed, it will be cached in the working memory of the thread. A lock that is kept in the main memory is associated with each object. Java provides the synchronized keyword for synchronization among multiple threads. When a thread enters a synchronized block, it tries to acquire the lock of the object. When a thread exits a synchronized block, it releases the lock, respectively. A synchronized block does not only guarantee exclusive access for the thread entering the block, but it does also maintain memory consistency of objects among threads that have performed synchronizations on the same lock. In particular, the JMM defines that a thread must flush its cache before releasing the lock, i.e. all working copies are transferred back to main memory. Before a thread acquires a lock, it must invalidate all variables in its working memory. This guarantees that the latest values are read from main memory.

3 Related Work In this chapter we give an overview of different approaches to build a distributed runtime environment to run multi-threaded Java applications in parallel within a cluster. We discuss the used techniques shortly and highlight the advantages and disadvantages of each system. Finally, we compare the existing approaches with our work and point out our motivation to implement such a runtime system. As done in the paper about JavaSplit [16], existing distributed runtime systems for Java can be classified into three different categories: Distributed Runtime Systems Advantages Disadvantages Cluster-aware VMs Java/DSM cjvm JESSICA JESSICA2 djvm Efficiency Direct memory access Direct network access Portability Compiler-based Hyperion Jackal Using standard JVMs JavaParty JavaSplit Compiler optimizations Local execution of machine code Portability Portability Reduced Single System Image Table 3.1: Distributed runtime systems overview. Each category has its benefits and drawbacks when it comes to portability and performance. While systems based on a standard JVM usually preserve portability, they might not provide a complete Single System Image (SSI), i.e. they deviate from the standard Java programming paradigm by introducing new keywords for example. The systems in the other two categories usually have a 11

12 3. Related Work complete SSI, but they have portability issues. In the following, we explain the concepts for these different distributed runtime environments and describe some existing implementations. 3.1 Compiler-Based Distributed Runtime Systems In compiler-based distributed runtime systems, the source or the bytecode of a Java application is compiled into native machine code. During translation, calls to DSM handlers are added and several optimizations are done to improve the performance. Since the Java application is compiled completely to machine code before its execution, the speed is increased because a Just-In-Time (JIT) compiler is not required anymore. Compiler-based systems do also preserve the standard Java programming paradigm. However, updates to such compilers need to be done if there are changes to the Java programming language. Jackal [28], for example, compiles Java source files directly into machine code. With the addition of generics in Java 5, adjustments to the compiler must be done. Besides, the source code of the Java application might not be available due to confidentiality reasons. As a consequence, such systems do not scale very well regarding portability because they do not have the opportunity of using reflection and therefore they have a limited language support. 3.1.1 Hyperion Hyperion [23] is a compiler-based DSM system that translates Java application s bytecode into C source code that is finally compiled into native machine code by a C compiler. While translating the bytecode, code for DSM support is added and several optimizations are done to enhance performance. Since Hyperion allocates all objects in a global object heap, the local garbage collector needs to be extended. However, this has not been done in the paper [23]. A benefit of Hyperion compared to Jackal is that new additional Java constructs will not necessarily result in changes to the bytecode. Therefore, Hyperion is less sensitive to evolution of the Java language but portability is still an issue. 3.2 Systems Using Standard JVMs Since distributed runtime systems utilize a standard JVM, they usually benefit from the portability. These systems can use a set of heterogeneous cluster nodes. As a result, each node is able to do local optimizations, e.g. using JIT compiler, and the local garbage collector of the standard JVM can still be used. Such systems however have two main drawbacks. The access to a node s resource is relatively slow since the access must usually go through the underlying JVM. Additionally, most of these systems deviate from the standard multi-threaded Java programming paradigm. Systems like JavaParty [31] try to be close enough to pure Java by only introducing a small set of new language constructs such as the remote keyword to indicate which objects should be distributed among

3. Related Work 13 the cluster nodes. The source code is transformed and extended with RMI code. Therefore, the SSI is further reduced since the programmer must differentiate between local and remote method invocations due to the different parameter passing convention. JavaSplit [15, 16] is also a distributed runtime system that uses only a standard JVM. The authors claim that JavaSplit has all the benefits of systems using unmodified JVMs. In addition, JavaSplit preserves SSI. 3.2.1 JavaSplit The techniques used in JavaSplit are based on bytecode instrumentation, performed by the Bytecode Engineering Library (BCEL) 1. JavaSplit takes the Java application as input, rewrites the bytecode and combines the instrumented code with an own runtime logic that is also implemented in Java. The result is a distributed application that is ready to be executed on multiple standard JVMs. During bytecode rewriting, events in the context of a distributed runtime are intercepted such as the start of new threads. Such calls are replaced with calls to a DSM handler (similar to Hyperion described above) that ship the thread to a node chosen by a load balancing function for example. JavaSplit does not require any special language construct, thus preserving a complete SSI. To make use of the standard JVM s garbage collector, they classify objects into shared and local objects. The former are handled by their object-based DSM garbage collector while the latter are reclaimed by the local garbage collector [15]. In addition, JavaSplit should be able to be executed on any standard JVM. However, in [16] the authors mention that the bytecode rewriter does not intercept I/O operations of the user application. Instead, the Java bootstrap classes that perform low-level I/O are modified to achieve the desired functionality. In other words, the standard JVM needs to be modified and therefore portability issues arise again. 3.3 Cluster-Aware JVMs Cluster-aware JVMs are usually a result of modifying an existing standard JVM by adding distribution. These systems consist of a set of nodes that have the same homogeneous JVM installed and that execute their part of the distributed execution. In comparison to systems using standard JVMs, cluster-aware JVM benefit from efficiency since direct access to resources such as memory and network are possible rather than going through the JVM. Furthermore, cluster-aware JVM usually preserve an SSI. On the other hand, due to the homogeneity, cluster-aware JVM systems lack true crossplatform portability. In the following we introduce several known cluster-aware JVM systems. 1 http://jakarta.apache.org/bcel/

14 3. Related Work 3.3.1 Java/DSM Java/DSM [30] was one of the first JVM containing a DSM. A standard JVM is implemented on top of a software DSM system called TreadMarks 2. The DSM handling code is done by TreadMarks since all Java objects are allocated over this DSM system. However, since the thread s location is not transparent to the programmer, the SSI provided by Java/DSM is incomplete. 3.3.2 cjvm Instead of using an underlying software DSM, the Cluster VM for Java called cjvm [9, 10] makes use of a proxy design pattern, i.e. rather than requesting a valid copy of an object as in DSM systems, the access is executed as a remote request and the actual execution is done on the node where the object is actually located. No consistency issue is involved in this approach. cjvm runs within an interpretation mode. To improve performance, several optimizations such as caching and object migration techniques have been implemented. 3.3.3 JESSICA2 JESSICA2 [35] is based on the Kaffe VM 3 and embeds an object-based DSM called Global Object Space (GOS). JESSICA2 implements many features such as thread migration, an adaptive home migration protocol for its GOS and a dedicated JIT compiler, since the standard JIT compiler cannot be used (in contrast to systems using standard JVMs). Their dedicated compiler improves the performance significantly compared to an interpreter. Objects are separated into shared and local objects. While shared objects are put into GOS and are handled by a distributed garbage collector, local objects are reclaimed by the VM s local garbage collector. Like cjvm, JESSICA2 provides a true SSI. 3.3.4 djvm The Jikes distributed Java Virtual Machine [36] called djvm was developed at the Australian National University in Canberra and is the first known distributed JVM based on the JikesRVM. djvm follows a similar approach as cjvm by generating proxy stubs for remote access on objects rather than implementing a GOS like in JESSICA2. The distribution code was added to the JikesRVM version 2.2, which is out-of-date. In a later version of the djvm 4, the developers switched to a bytecode rewriting approach by using the BCEL as done in JavaSplit 2 http://www.cs.rice.edu/~willy/treadmarks/overview.html 3 http://www.kaffe.org/ 4 http://cs.anu.edu.au/djvm/

3. Related Work 15 3.4 Conclusion In this chapter, we have given an overview of existing distributed Java runtime systems and highlighted their benefits and drawbacks. As discussed, systems using standard JVMs usually have the advantage of portability but they have usually worse performance or introduce unconventional language constructs. For performance reason, we pursue a cluster-aware JVM approach at the cost of portability. We decided to embed our work into the JikesRVM for several reasons: The JikesRVM is open-source and provides a stable infrastructure that allows experimenting with different implementation alternatives. The VM is consistently maintained and the documentation is mostly up-to-date. Its code base is relatively small compared to other open-source JVMs such as the OpenJDK. Additionally, the code is well-structured so that it is possible to understand the internal workings in an appropriate time. Unlike other open-source JVMs that only provide an interpreter, the JikesRVM contains an optimizing JIT compiler that produces high quality code that can be used in a productive environment. Since the JikesRVM 5 only runs on Linux-based operating systems, portability is not a big issue for our work. Motivated by the concept of object-based DSMs in JavaSplit and JESSICA2 and by the fact that no known work has been done for the JikesRVM, we have embedded a virtual global object heap called the Shared Object Space into the JikesRVM and added mechanisms to achieve a complete SSI. Although running a VM on top of a page-based DSM such as Java/DSM would reduce the complexity of constructing a global object heap since the cache coherence issues are managed by the DSM system, there are some drawbacks to be mentioned: Since the sharing granularity of Java and the page-based DSM are not compatible (objects of variable size vs. virtual memory pages of a fixed size, cf. Section 2.2), the false sharing problem arises. Additionally, if a Java object resides across virtual memory pages, a fault-in results in requesting two pages which is quite heavyweight. Because the page-based DSM resides in a layer below the VM, it is not able to use the runtime information of the VM. In addition, the detection of access patterns of an application becomes difficult since several Java objects could reside in a single memory page. In contrary, our Shared Object Space can fully utilize the abundant runtime information of the VM that allows further optimizations. 5 For our work, we used the JikesRVM version 2.9.2 that was the latest stable build at the time writing the thesis. Recently, version 2.9.3 and the latest 3.0 have been released.

16 3. Related Work In paged-based DSM systems, each node manages a portion of the shared memory where the objects are allocated. As a consequence, each object is considered as shared requiring special treatment when it comes to garbage collection. In our Shared Object Space we distinguish objects shared among different nodes from local objects since local objects require no special treatment. Due to the above reasons we decided to develop an object-based DSM.

4 Design In this chapter we will discuss the design decisions we made when adding a distributed runtime system to the JikesRVM. We show how we extended the JVM to be cluster-aware with a masterslave architecture. We implemented a global object heap by introducing the Shared Object Space by defining shared objects and how we solved memory consistency issues across node boundaries. Mechanisms of a distributed classloading, thread synchronization and garbage collection and for distributing the workload across multiple machines are also presented. To achieve transparency for Java applications, we also show how I/O is redirected in our distributed JVM. 4.1 Overview The design we chose for our distributed JVM (DJVM ) is based on a single system image (SSI), i.e. the underlying distribution should be transparent to the Java application running in the DJVM. Rather than using a page-based DSM software as done in Java/DSM [30] and JESSICA [24], we decided to implement an own object-based DSM into the JikesRVM to make the VM aware of the shared memory which allows us to do further optimizations as described in Section 7.2. Figure 4.1: DJVM overview. For our cluster we use a master-slave architecture where one node in the cluster embodies the master 17

18 4. Design node and the other nodes become so-called slave or worker nodes 1. The master node is assigned some special functions such as setting up the cluster and controlling all global data (cf. Section 5.3.5) and classes loaded by the Java application. Furthermore, threads of the Java application are scheduled and distributed via the master node which is therefore responsible for shutting down the cluster when all threads of the application have terminated. Since threads can be distributed among different nodes in the cluster, they must still be able to synchronize on objects located on another node. We therefore developed a Shared Object Space that combines the object heap of every cluster node that represent a virtual global object heap. 4.2 Home-Based Lazy Release Consistency Model In Section 2.3.1 we have given a description of the LRC model. In [32] it has been shown that LRC does not scale very well because of the large amount of memory consumption for the protocol overhead so that garbage collectors have difficulties in collecting that data. To achieve a better performance, LRC has been extended to Home-based Lazy Release Consistency (HLRC) [32]. Each shared memory page is assigned a home-base to which all propagations are sent to and from which all copies are fetched. The main idea of HLRC is to detect possible updates to a shared memory page, compute the diffs and send them back to their home bases where those diffs are applied. The diffs are computed only on non-home bases by comparing the cached shared memory page with a so-called twin page, that contains the original data upon the initial page request. The advantages of HLRC compared to LRC are that accesses to pages on their home-base will not cause any page faults. Additionally, non-home bases can update their copies of shared pages in a single roundtrip message. Thus, the protocol data and messages are smaller because the detection of the processor that has most recently released a lock is omitted in HLRC. This makes the timestamp technique used in LRC unnecessary. 4.2.1 Adaption to Our DSM Our DSM is based on HLRC. Since our DSM is object-based rather than page-based, every object is assigned to a home node which is the node from where the object s reference is initially exposed across node boundaries. When an object 2 is requested from an other node, a copy of the requested object is created on the non-home node. Instead of creating the twin as a separate object as proposed in [32] and [20], we place the twin s data at the end of the requested copy which saves us two words per shared object since we can omit the header for the twin. We adapted our cache coherence protocol to become similar to the JMM (see Section 2.3.2). Upon a release of a lock, diffs are computed for all written non-home nodes objects by comparing each field of the copy with their respective twins. The updates are then sent in a message to the according home node that finally applies the updates to the home node object. This procedure corresponds 1 In the following, these terms can be substituted with each other. 2 A shared object to be precise, see Section 4.4.

4. Design 19 exactly to the cache flush in the JMM. When a lock is acquired, the JMM states that all copies in the working memory of a thread must be invalidated [21]. In our DSM, this means that every non-home node object must be invalidated such that further reads or writes to an invalid object will cause the runtime system to fetch the object s latest value from its home node. 4.3 Communication Since our Shared Object Space is embedded into a cluster-based JVM, we have to make JikesRVM cluster-aware by implementing a communication subsystem that is responsible for sending and receiving messages between nodes in the cluster. Our CommManager is an interface to hide the actual underlying hardware involved during communication. Currently, we provide two different communication channels. Raw sockets: Provide fast communication by encapsulating messages directly in Ethernet frames. TCP/IP sockets: Provide reliable communication between the cluster nodes. Under the assumption that all our cluster nodes are located within the same subnet in a local area network, we provide a communication substrate based on raw sockets. Because messages sent over these sockets are encoded directly in Ethernet frames, we avoid the overhead of the messages going through the TCP/IP stack of the operating system. As shown in a comparison in [6], raw sockets can even outperform UDP sockets that do not provide reliability compared to TCP. Broadcasting messages over raw sockets is also a benefit since only one message 3 must be sent to address all nodes in the cluster. However, due to the lack of reliability by using raw sockets, we also provide a communication based on TCP/IP 4 Our cluster supports a static set of nodes. At startup, after booting the CommManager, the master node applies for the role of a coordinator that waits until all worker nodes have established a connection to the master node 5. When all worker nodes have setup a connection, the master node sends unique node IDs, that are assigned to each node, and the nodes MAC or IP address. When using TCP/IP sockets, the worker nodes further establish a communication channel between each other so that nodes can send messages to and receive messages from each other. 4.3.1 Protocol For the serialization process of our messages, we avoided the standard object serialization provided by the Java API because the object serialization provides meta information encapsulated in the serialized object bytestream to reconstruct the object on the receiver side which leads to unnecessary 3 By using the MAC address FF:FF:FF:FF:FF:FF. 4 TCP is reliable because lost messages are retransmitted accordingly and messages are received in-order. 5 The total amount of worker nodes is given as a parameter when starting the cluster-based JikesRVM.

20 4. Design overhead of the communication traffic. Instead, we designed an own protocol consisting of a 15 byte header that is sufficient to reconstruct the message on the receiving node. Message Message Node-ID Packet Total Message- Message Type Subtype Number Packets ID Length 1 byte 1 byte 1 byte 2 bytes 2 bytes 4 bytes 4 bytes Table 4.1: Protocol header. The Message Type field defines the type of the message such as classloading, I/O, object synchronization, etc. The Message Subtype field specializes the exact type of the message, e.g. the ObjectLockAcquire subtype that is sent when a remote lock for an object should be acquired. The Node-ID contains the unique ID of the sending cluster node. Since a message cannot exceed the length of 1500 bytes when using raw sockets, every message larger than this limit is splitted into several packets. The Packet Number defines the actual packet of the possibly splitted message whereas the Total Packets field states the total amount of packets the splitted message consists of. Finally we have two 4-byte fields for the Message-ID that is used for messages requiring an acknowledgment and for the Message-Length defining the length of the message excluding the 15-byte header per packet. When sending a message, it is encoded into a buffer that is obtained from a buffer pool, set up by the CommManager to avoid allocation upon each use. If the buffer pool becomes empty, new buffer objects will be allocated to expand the pool. A filled buffer contains all necessary data to reconstruct the message object on the receiving node. By serializing the messages ourselves, we reduce communication traffic for each transmitted message compared to the standard object serialization API in Java. For every socket, a MessageDispatcher thread waits in a loop to receive messages. This thread dispatches all incoming packets by reading the 15 bytes of the protocol header first and assembling them together to construct a new message object which can then be processed by calling the process() method (more details are given in Section 5.1). 4.4 Shared Objects In our DJVM, the object heaps of every node contribute for a global object heap that we call the Shared Object Space. Before we start to explain the design of our Shared Object Space, we discuss some memory related issues by defining Shared Objects, what their benefits are and how they are detected to become part of the Shared Object Space. 4.4.1 Definition An object is said to be reachable if there is a reference in a Java thread s stack or if there exists a path in a connectivity graph 6. When an object is reached by only one thread, that object is considered 6 Two objects are considered as connected when one object contains a reference to the other object. A connectivity graph is formed by the transitive relationships of connected objects.

4. Design 21 as a thread-local object and since only one thread operates on that object, synchronization is not an issue whereas objects reached by several threads need to be synchronized (cf. Section 2.3.2). In a distributed JVM an object can also be reached from Java threads that reside on a different node in the cluster and therefore we distinguish between (based on [16, 17]) node-local objects that are only reached by one or multiple threads within a single node and shared objects that are reached by at least two threads located on different nodes. This means when an object is exposed to a thread residing on a different node than the object itself, the object becomes a shared object. 4.4.2 Globally Unique ID In a JVM running on a single node, every object in the heap can be reached by loading its memory address. In a distributed JVM though, locating an object by a memory address is not sufficient enough since an object o1 at a specific address on node N1 is not necessarily located at the same address on another node N2. Shared objects are required to be reachable from any node per definition. For that purpose, every shared object is assigned a Globally Unique Identifier (GUID). Every node keeps track of a GUID table that maps the object s local address against its GUID which allows to do pointer translation across node boundaries. A GUID consists of 64 bits, containing an Object-ID (OID) and a Node-ID part: 0 }{{} 1 Bit, unused 11 }{{} 2 Bits, Node-ID 1...1 }{{} 61 Bits, Object-ID We left the most significant bit unused that can be used freely for example for garbage collection. Since our DJVM is a prototype of a distributed runtime system that is designed for small clusters, we only use 2 bits for addressing at most 4 different nodes, leaving 61 bits for the OID. However, our design is flexible so that this can be changed by modifying a constant. As discussed in Section 4.2.1, every shared object is assigned to a home node. Therefore, the information encapsulated in a GUID suffices to relocate an object on a particular node. 4.4.3 Detection By using the definition in Section 4.4.1 an object becomes a shared object if two threads located at different nodes reach the object. Because of Java s strong type system every object on the heap or thread-local objects on the thread s stack is associated with a particular type that is either a reference type or a primitive type. Since the JikesRVM runtime systems keeps track of all types, it is easy to identify which variables are reference types. By creating an object connectivity graph we can detect at runtime if a particular object will be reached from a thread s stack. When a thread is

22 4. Design scheduled on a different node in the cluster (cf. Section 4.6) all reachable objects from the thread s stack become shared objects since they are reachable from a different node, too. The idea of marking all reachable objects as shared whenever an object becomes shared leads to several performance issues since objects with a large connectivity graph have to be traversed. We encountered problems when objects that reference VM-internal objects had been marked as shared. For the reason that we cannot distinguish VM objects from application objects in the JikesRVM (see Section 7.1 for more details), we use a lazy detection scheme for shared objects similar to the one described in the paper [18] along with explicit exception cases that some objects never become shared objects to avoid this problem. It makes more sense to share objects lazily since the maintenance cost of a node-local object is lower and because it is undecidable if an object will be accessed by its reaching thread. When a thread is distributed from a sender node to a receiver node, we only mark all directly reachable objects from the thread s stack as shared objects and a GUID will be assigned to them. By sending the thread to the receiver node, the GUIDs belonging to the references of the thread object will also be transmitted along. On the receiving side we reconstruct the thread object and the received GUIDs are looked up in the GUID table which maps a GUID against a local memory address. If the GUID is found, the local memory address of the object belonging to the GUID will be written into the corresponding object reference field. On the other hand if the GUID is not found, that means that the object has never been seen before and the object reference field is replaced by an invalid reference. Our software DSM detects accesses to such dangling pointers which leads to a request of the corresponding object on its home node, i.e. the copy of the object is faulted-in from the home node. Consider an example as shown in Figure 4.2. Thread T1 has a reference to the root object a. Thread T2 is created and a reference to object c is passed (Figure 4.2(a)). Since object c is reachable by thread T2, c will be detected as a shared object when T2 is distributed and started on node 2 (Figure 4.2(b)). Note that object a and b remain local objects since they are only reached by the local thread T1. Upon the first access to c by T2, object c will be requested from node 1. Because object c has references to d and e, they become shared objects as well (Figure 4.2(c)). Node 2 receives the GUIDs of d and e and a copy of c for which dangling pointers are installed representing remote references to d and e. When T2 accesses object d in a later step, d will be faulted-in from node 1 (Figure 4.2(d)). 4.4.4 States As seen in Section 4.2.1 shared objects can have some different states as shown in Figure 4.3. Non-home shared objects can become invalid after an acquire operation requiring the object the be faulted-in on further reads or writes. Since only modified non-home objects must be written back to their corresponding home node upon releasing a lock, we set a bit in the shared object s header upon a write operation, which enables that only diffs from non-home shared object are computed 6 E.g. A java.lang.thread object has a VMThread as a reference that is a VM-internal object.

4. Design 23 T1 T2 a b c d e (a) Initial object connectivity graph. T1 T2 a b c s d e Node 1 Node 2 (b) Distribution of T2. T1 T2 a b c s c' s d s e s Node 1 Node 2 (c) Fault-in of object c. T1 T2 a b c s c s d s e s d s Node 1 Node 2 (d) Fault-in of object d. Figure 4.2: Lazy detection of shared objects.

24 4. Design that were actually changed. After the diffs have been propagated to the home nodes, the write bit is cleared. Figure 4.3: Shared object states. 4.4.5 Conclusion By combining the object heap of several JVMs, we form a single virtual heap that consists of our so-called Shared Object Space and a local heap. Shared objects are moved into the Shared Object Space when they are reachable from at least two threads on two different nodes. Node-local objects remain in the local heap. This separation of the heap is different from other approaches described in Chapter 3 where usually all objects are allocated in one global object heap only. The lazy detection scheme helps us to gain performance when dealing with memory consistency issues. As specified in the JMM, a Java thread s working memory is invalidated upon a lock and the cached copies must be written back to main memory upon an unlock. In our context this means that these operations are triggered when a node-local object is acquired or released. Further, in our DJVM we have to consider distributed memory consistency issues since changes to cached copies must be written back to their home nodes: When a shared object is locked or unlocked, these operations are much more expensive compared to their local counterparts. By sharing only objects that are currently accessed by a remote thread, we reduce the communication traffic of the distributed cache coherence protocol (see Section 5.3.4). 4.5 Distributed Classloading The JikesRVM runtime system keeps track of all loaded types. If a type is loaded by the classloader, so-called meta-objects such as VM_Class, VM_Array, VM_Field, VM_Method, etc., describing classes, interfaces, fields and methods respectively, are created. Therefore, if classes are loaded into the JVM, the type information maintained by each cluster node is altered. For our DJVM we make

4. Design 25 use of a centralized classloader (similar to [36]) that helps to achieve a consistent view of all loaded types on all nodes in the cluster by replicating these meta-objects to the worker nodes. Loading all classes on a centralized node simplifies the coordination but it also has the disadvantage of creating a bottleneck. But since classloading becomes less common during the runtime of a long running application, we believe that this should not effect the performance. If classes have been loaded once, they can be compiled and instantiated locally. Since the JikesRVM is written in Java, a set of classes is needed for bootstrapping. These classes are written into the JikesRVM bootimage and are considered initialized and do not need to be loaded anymore. When the JikesRVM is booted, some additional types are loaded by a special bootstrap classloader to set up the JikesRVM runtime system. After booting, the VM types of the Java application will be loaded by the application classloader. In our DJVM, the cluster is set up during boot time of the VM. Therefore, to achieve a consistent view on all cluster nodes, we must further divide the boot process into two different phases: 1. Before the cluster is set up, all types that must be loaded prior to a node can join the cluster. 2. After the cluster is set up, some additional types needed for the VM runtime system are loaded. To achieve a consistent view of the type information on all cluster nodes before the cluster is set up, we add all classes into the bootimage prior to the cluster booting. Since only one thread is running during the boot process, we can guarantee consistent class definitions on all nodes at this point. After the cluster is setup, consistency is maintained by the centralized classloading mechanism described above. Java Class File read() VM_Class load() Loaded resolve() Resolved instantiate() Instantiated (pre) initialize() Initializing (post) initialize() Initialized Figure 4.4: Classloading states. When a class, that is not located in the bootimage, is loaded, a Java.class file is read from the filesystem and a corresponding VM_Class object in the VM is created and put into the loaded state as shown in Figure 4.4. Resolution is the process where all (static) fields and methods of the class are resolved and linked against the VM_Class object (state resolved). After compiling all static and

26 4. Design virtual methods, the object is considered to be instantiated. Initialization finally invokes the static class initializer which is only done once per class. During initialization the VM_Class object is put into the initializing state, after the process has completed the state changes to initialized. Because the JikesRVM stores all static variables in a special table, all classloading phases except instantiation must be done by the master node. The master node executes the classloading phase and replicates the resulting type information to the cluster nodes. The instantiation process can be done locally on any node since no entries are inserted into the static table. An exception exists for the initialization phase for classes that are considered as local per node only, such as runtime supporting classes (see Section 5.3.5). If such a class is initialized, the initialization is also done locally per node since the class is intended for internal management purposes and therefore not considered as global 7 4.6 Scheduling To gain performance in a cluster-aware JVM, threads from the Java application are distributed automatically to the nodes in the cluster. In our DJVM we use a load balancing mechanism based on the Linux Load Average [5] that is a time-dependent average of the CPU utilization. Particularly, the averages of the last minute, 5 minutes and 10 minutes are shown by calling this command. Every worker node reports the Load Average of the last minute periodically to the master node, i.e. after a certain time period, a message containing the Load Average is sent to keep the CPU utilization up-to-date. Load balancing is achieved by distributing Java application threads to the node in the cluster with the lowest load. Upon the start of an application thread, the master node is contacted to get the ID of the cluster node with the lowest Load Average. The thread is then copied to the cluster node where the thread is finally started. The node, to which the thread is distributed, should become the new home node for that thread object due to access locality, i.e. if the node where the thread is distributed to is a non-home node of the thread object, upon every lock operation the thread object would become invalid due to memory consistency. As a consequence the thread object must be faulted-in from the home node of the thread object which is quite inefficient. Since our DJVM does currently not support object home migration (see Section 7.2.1), we introduced a mechanism to simulate object home migration for threads to avoid these access locality problems (described in Section 5.5.1). We follow an initial placement approach because it is lightweight and easier to implement compared with thread migration (see Section 7.2.3). 4.6.1 Synchronization Even after Java threads have been distributed among nodes in the cluster in the DJVM (cf. Section 4.6), they should still be able to communicate with each other through the synchronization 7 Because it is not application relevant but it is used internally by the runtime system.

4. Design 27 operations lock, unlock, wait and notify across node boundaries. In our design that is based on the HLRC protocol described in [32], each shared object is assigned a respective home node on which the synchronization operations are executed, i.e. if synchronization happens on a non-home shared object, a synchronization request is sent to the corresponding home node where the synchronization is finally processed. If the shared object is already located on its home node, the lock can be acquired locally without sending a request. Due to our classloading concept where loaded classes are always replicated from the master node to its worker nodes, all objects of type Class are automatically considered shared, with the master node as their home node. Whenever a synchronization is executed on a Class object, the request is redirected to the master node because there is only one instance of a Class object. 4.7 I/O Redirection As mentioned earlier, our DJVM is based on an SSI design even for I/O. To achieve this, all I/O operations are redirected to the master node. As a consequence whenever the Java application opens a a file for example on a worker node, the file name will be forwarded to the master node where the file is actually opened. The file descriptor is returned to the worker node. If any read or write operations happen in a later stage, the worker node sends this file descriptor to the master node where the I/O operation is executed and the result is sent back to the requesting worker node. The redirections do also apply for the standard file descriptor meaning that the output of a System.out.println() in the Java application will always be displayed on the master node. With our approach, we avoid a global view of the filesystem such as NFS used in [17] so that each JVM instance on each cluster node sees the same file system hierarchy of the master node. 4.8 Garbage Collection In this section, we give an algorithm for a garbage collector in our Shared Object Space. We describe a distributed Mark and Sweep garbage collector. Since Mark and Sweep does not involve copying objects in the heap, updates of object addresses in the distributed GUID table (cf. Section 5.3.2) are not needed. To add distribution to Mark and Sweep, we need to introduce three synchronization points. 1. If a node in the cluster runs out of memory, garbage collection is executed and a message should be sent to all nodes to force their garbage collectors to run (first synchronization point). 2. The mark phase starts on every node locally. 3. If there are any marked objects that are also marked as shared, a message must be sent back to the shared objects home nodes where the objects will also be marked (if not al-

28 4. Design ready happened) because the shared objects are still reachable from a remote thread (second synchronization point) 4. After detecting every reachable object, the sweep phase can run locally on each cluster node. Garbage collected shared objects that have been inserted into the GUID table earlier should be removed from the table. 5. Since the sweep phase stops every thread in the system, the worker nodes report to the master node when they have finished the sweep phase. If all cluster nodes are done with sweeping, the master node sends a broadcast message to all worker nodes to inform them to continue execution (third synchronization point). Steps 2 and 4 are from the usual Mark and Sweep garbage collector whereas steps 1, 3 and 5 are needed because of the distributed nature of the JVM. It should be mentioned that the second and the third synchronization points add an overhead since all threads must be kept suspended until the corresponding messages arrive. Note that a local garbage collector will not suffice since shared objects on the home node cannot be reclaimed if there are remote references to it on another cluster node. Turning back a shared object on its home node to a node-local object should be done in step 4 when all non-home nodes have reported their remote references. If a shared object is marked as reachable during the mark phase but has no remote references pointing to it, the shared object can be removed safely from the Shared Object Space since the object is only reached by local threads (cf. definition in Section 4.4.1 and future work in Section 7.2.6).

5 Implementation This section gives some implementation details for the design decisions made for the DJVM as described in Chapter 4. 5.1 Messaging Model As explained in Section 4.3 the communication within the cluster is based on a message model. We have developed a set of messages that inherit from a common abstract superclass Message. By introducing a class tree of different message types, the messaging model becomes flexible and extendible at the cost of some overhead because of additional method calls and conversions. The Message class provides some concrete implementation such as for getting the message s type, subtype or ID as described in Section 4.3.1. Furthermore, each message contains code for sending, processing and converting a message into bytes and vice versa: send: This concrete implementation gets a message buffer from the buffer pool and converts the message into a bytestream by calling the method tobytes. The message buffer containing the message s data is then sent over the socket to the receiving node. tobytes: This method is abstract and a concrete class extending from Message must implement it. In this method the message s data should be serialized into a bytestream that can finally be sent over the communication socket. dispatch: This static method is intended for message deserialization. The Message object is constructed on the receiving side by decoding the bytestream from the sender. The resulting data are passed to a private constructor such that the message object can be created. process: After constructing the Message object on the receiver node, the message is finally processed by the MessageDispatcher thread. This method is abstract and each subclass defines the actual work that must be done upon receiving this message. When a message is received, the MessageDispatcher thread calls the static dispatch method of the class Message where the header of the packet (cf. Table 4.1) is inspected to get the message type and its subtype. E.g. the type could be MessageIO, also an abstract class from which all concrete 29

30 5. Implementation messages extend that are related to I/O such as MessageIOChannelReadReq that contains code for an I/O read operation request and therefore is declared as its subtype. The received bytestream is finally passed to the dispatch method of the concrete implementation so that the Message object is created and can be processed. Some messages like I/O messages require the sending thread to block until the reply arrives. If a message needs an acknowledgement, the message ID combined with the destination s node ID and the Message object is inserted into a message registry. The sending thread is forced to wait 1 on the Message object until the response is received. By processing the acknowledged message the MessageDispatcher thread removes the corresponding entry in the message registry and wakes up the waiting thread such that it can resume its execution. 5.2 Boot Process In this section, we briefly describe the modifications of the boot process of the JikesRVM, especially we explain how the cluster is set up. When the JikesRVM is launched, a single thread is started which is responsible for the bootstrapping. Since our cluster needs to be set up during boot time so that the centralized classloading is activated (cf. Section 4.5), the boot thread sets up the communication by instantiating the corresponding raw or TCP sockets and starting a MessageDispatcher thread (and a MessageConnectionListener thread if TCP is used). At this point the boot process of the master node and the worker nodes become a bit different as Figure 5.1 shows. For the master node the following steps are executed: 1. Since the number of cluster nodes is static and given as a parameter at the start of the VM, the boot thread will wait until all worker nodes have established a connection to the master node and have reported their node ID and MAC / IP address. 2. The boot thread wakes up when all worker nodes made a successful connection to the master node and finishes booting the VM. This involves setting up all runtime support data structures. 3. When the VM is booted, the master node sends a message to the worker nodes telling them to finish their booting process. The message sent to the worker nodes contains all node IDs and their corresponding MAC / IP addresses. 4. Before the boot thread terminates, a VM_MainThread is started to execute the main method of the Java application. For the DJVM, this main thread is blocked until the worker nodes have done their bootstrapping. 1 Only the sending thread is blocked, the other running threads in the system can continue their execution.

5. Implementation 31 Worker Node 1 Master Node Worker Node 2 boot thread runs bootstrapping code and sets up cluster boot thread waits until all node IDs and node addresses are announced by the master node. join cluster boot thread waits until all worker nodes have joined the cluster join cluster boot thread finishes booting process, sends a message to the master node send cluster info finished booting boot thread finishes booting process, informs worker nodes about node IDs and node addresses, starts main thread send cluster info main thread waits until all worker nodes have finished booting finished booting main thread waits until cluster shutdown message is received Worker nodes are ready to schedule threads from the Java application main thread executes the Java application s main method Figure 5.1: Cluster boot process. 5. If all worker nodes have acknowledged that they have finished booting, the main thread is woken up and executes the main method of the Java application. The worker nodes execute the following operations: 1. After establishing a connection to the master node, the boot thread waits until the master node has finished its boot process. 2. When the message from the master node is received, the worker nodes know each others node IDs and MAC / IP addresses. 3. If TCP sockets are used, the worker nodes establish a connection to each other. 4. The boot thread wakes up to continue the bootstrapping. 5. After booting, the main thread is started on the worker nodes but it does not execute the main method of the Java application. Instead, the main thread waits for work, i.e. an application thread, so that the VM does not terminate.

32 5. Implementation 6. A message is sent back to the master node to report that the worker node has successfully booted. If all worker nodes have finished their boot process, they are ready to accept distributed threads from the master node. The main thread, that is started on each worker node, waits on a special lock and will be woken when the Java application terminates so that the worker node VMs can terminate (more details given in Section 5.5.2). 5.3 Shared Object Space Our Shared Object Space is defined as a virtualized global object heap consisting of shared objects. To move or remove an object into or from this space, we need a mechanism to declare an object as shared. In the following all related implementations to the Shared Object Space is given. 5.3.1 Shared Objects In the JikesRVM object model (described in Section 2.1.3) every object consists of a two word header, one containing a pointer to the Type Information Block (TIB) and one word is reserved for the status header. Per default this header is laid out as: LLLLLLLLLLLLLLLLLLLLLL } {{ } 22 Bits HH }{{} 2 Bits AAAAAAAA } {{ } 8 Bits 22 bits for storing locking information, 2 bits for the hash code state and 8 bits for further use. Since some garbage collectors use some of the available bits 2, we utilize the upper available 3 bits for our shared objects. A }{{} SHARED A }{{} INVALID A }{{} WRITE AAAAA The most significant bit of the available bits indicates if an object is actually a shared object. The next bit defines if the shared object is treated as invalid, i.e. before a further read or a write operation on an invalid shared object can be executed, the object must be faulted-in from its home node. The last bit in the status header is used to mark a shared object as written (see Figure 4.3). This bit helps us to improve performance at synchronization points. Compared to JavaSplit [16] where shared objects inherit so-called DSM fields from a super class, we save much more space since we do not have the overhead of storing the object s state (1 byte), locking status (4 bytes) and global ID (8 bytes) in fields. 2 The maximum available bits of four is used by the Mark and Sweep garbage collector.

5. Implementation 33 5.3.2 GUID Table Since shared objects are addressed by their assigned GUID as described in Section 4.4.2, every node in the cluster maintains a local GUID table where GUID objects are mapped against the shared object s local address. Particularly, we have introduced two maps for this purpose: GUIDAddressTypePairMap that maps GUIDs against so-called AddressTypePair objects which basically contain the object s local address and its type ID. ObjectAddressGUIDMap that maps the object s local address against the GUID which is used for fast lookup. When a shared object is faulted-in, the message that is sent from the home node to the requesting node contains the object s data, a list of GUIDs for all references in the object and a list of the types of the referenced objects. With this information and the available type information about the faulted-in object, a cached-copy of that object can be created and initialized in the following steps: 1. Allocate space for the object and for its twin object: The size of a cached copy, i.e. shared object on a non-home node, is twice the object s data size plus the object s header size. The information about the object s header and data size can be calculated with the locally available type information. 2. Initialize the object header by installing the TIB pointer and set the SHARED bit in the status header. 3. Copy the object s data into the allocated space and also fill the twin object. 4. With the type information the object s reference fields can be detected. After copying the object s data, these references are the local addresses of the objects on the home node, therefore we must replace them with the local addresses on the requesting node. 5. Iterate through the list of GUIDs contained in the message. Every GUID is looked up in the GUID table. (a) If the GUID is not found in the table, we make an entry in the GUIDAddressTypePairMap by mapping the GUID with the invalid address 0x00000001 and the object s type. The GUID s object address is stored into the cached-copy and the last bit is set to 1 which indicates that the object is not locally available (more details are given in Section 5.3.3) (b) If the GUID is found in the table, we check if the corresponding object address is valid. In this case the object is available locally (because it has been faulted-in before or because the object is on its home node) and we can store the object s address directly into the cached-copy s reference field. If the address is not valid, we again store the GUID s object address with the last bit set to 1 into the reference field.

34 5. Implementation 6. Make an entry in the ObjectAddressGUIDMap by mapping the faulted-in object s address against its GUID. Consider an example as shown in Figure 5.2. For the sake of simplicity, the type IDs have been omitted in this example but it should be mentioned that they are also part of the mechanism. Object o1 consisting of one int field and two reference fields is requested from its home node. Ref field1 points to an object at local address 0x1000 with GUID 1 and ref field2 points to an object at local address 0x2000 with GUID 2. The home node sends the object along with GUIDs to the requesting node where space for object o1 and its twin is allocated and filled with the object s data received in the message. GUID 1 is then looked up in the local GUID table where an entry is found that contains a valid address, i.e. the object is locally available. Therefore, the address 0x6000 is stored in both ref field1. GUID 2 is not found in the table and a mapping of the GUID with the invalid address 0x0001 is inserted into the GUID table. Since the GUID 2 is an object, ref field2 contains the address of the GUID object with the last bit set to 1 which helps our DSM to detect fault-ins. Figure 5.2: GUID table example. Note that the GUIDs sent in the message are primitive Java long types. GUID objects can only be created by using the static method findorcreate of the GUIDMapper class. Since every created GUID object is stored in a hash set in the GUIDMapper, we can make sure that we always have a reference to the GUID object such that the GUID object is not garbage collected. This step is needed because shared objects references pointing to GUID objects are modified as part of their initialization. 5.3.3 Faulting Scheme As described in Section 4.4.4, a shared object has several possible states: Write, Read and for non-home shared objects Invalid. Depending on the actual state several operations are triggered when a particular operation is executed on the object. For example if a read operation happens on an invalid object, the DSM updates the object s latest value from its home node and can then remove the invalid bit. A write to an object that is in the Read state, will change its state to Write, etc. Therefore, we require from our DSM that such faulting accesses are detected and handled correctly. For our faulting scheme we added software checks to the Baseline compiler 3 of the JikesRVM because our DSM is object-based and does not have the support of a MMU 3 Because of the complexity of the Optimizing compiler (cf. Section 2.1.1) we decided to add the software checks to the Baseline compiler only. Extending the Optimizing compiler is a topic for further work.

5. Implementation 35 for detecting traps. Since the Baseline compiler takes the bytecode of the Java application and compiles it into machine code, the software checks are added to those bytecode instructions that access the object heap. Particularly, we made adjustments to the compiler whenever machine code for one of the following instruction is emitted: Load an object field (bytecode instruction GETFIELD) Store a value into an object field (PUTFIELD) Load a static field of a class (GETSTATIC) Store a value into a static field (PUTSTATIC) Load a value of an array (xaload where x stands for the type of the array, i.e. reference type, I for integer type, etc.) A for Store a value into an array (xastore) For the instructions GETFIELD/PUTFIELD and x ALOAD/x ASTORE the checks that have been added are quite similar, where as GETSTATIC/PUTSTATIC are treated a bit different (see Section 5.3.5). Since the states described above are only related to shared objects, we added the software checks directly in assembly language in the corresponding emitcode methods of the compiler. Listing 5.1: Added software checks for loading a field of a primitive type 1 if( obj == SHARED ) 2 if( obj == INVALID ) 3 updateinvalidobject ( obj ) 4 load ( obj. value ) Listing 5.1 shows the pseudocode for loading an object field of a primitive type. Since the object s state information is included in its header, the checks are done by emitting assembly instructions. For the case that the object need not be updated (i.e. the shared object is not invalid), we can jump directly to the load instruction meaning that we have one load instruction for the header, two compare instructions for the shared and invalid state check and one potential jump instruction as an overhead per load. Using this implementation, we postpone the expensive call of the function updateinvalidobject() (implemented in the class SharedObjectManager) to fetch the latest object values from its home node. Since in our DSM implementation dangling pointers are used, i.e. for references into the GUID table the last bit is set to 1, we need to add one more check whenever a reference field is loaded as shown in Listing 5.2. As described in Section 5.3.2, if the reference s least significant bit is set to 1, the requestobject() method is executed by passing the 4-byte-aligned reference. The parameter is then actually a pointer to the GUID object. In the requestobject() method, we first check if the GUID is mapped against a valid object address because it could have been faulted-in in the meantime. If

36 5. Implementation Listing 5.2: Added software checks for loading a reference field 1 if( obj == SHARED ) 2 if( obj == INVALID ) 3 updateinvalidobject ( obj ) 4 if( addrof ( obj. ref ) == invalid ) 5 requestobject ( obj. ref ) 6 load ( obj. ref ) this is the case, the address is stored into the object s reference field. Otherwise the object is faulted-in from its home node that can be determined by the parameter GUID. In [18] they pursue a different approach to avoid dealing with dangling pointers. Instead of storing invalid addresses into reference fields when an object is faulted-in, they allocate so-called dummy objects for each reference. Dummy objects have the same size as the real object, but are not initialized and therefore contain no data. In the reference fields of the faulted-in object, they store the address of the dummy objects such that no invalid pointers exist. In our approach we have an additional compare instruction when loading reference fields but we save much more space since we are not required to allocate space for these dummy objects. Additionally, these allocated dummy objects might never be accessed and if one object referencing these dummy objects is still reachable during garbage collection, they will not be removed from the heap. An alternative implementation of the faulting scheme is to make use of the JikesRVM trap handler. Java threads use a predefined set of system calls to communicate with the underlying operating system. If a trap occurs, a special trap handler forwards the signal from the OS back to the calling Java thread as an exception. The faulting access could therefore be handled in the trap handler. Since handling traps is much more expensive because of user and kernel space switching, it makes sense to do the checks in software if many shared objects have to be faulted-in. Because this approach is more complex than adding software checks, this is a topic for future work (see Section 7.2.9. In [23] an object table is used and object reference are indices into this table. Each entry consists of two fields, a local object pointer and a remote object pointer which is similar to our idea that not yet faulted-in object references points into the GUID table. However, since local object references also point into this object table, an additional load instruction is done for every object access. In our DSM only one compare instruction is needed to detected if a object is shared or local. The latter can be directly loaded from memory without loading the address in the object table entry first. 5.3.4 Cache Coherence Protocol In HLRC all non-home node objects must be written back to their home nodes if there was a previous write to the object. In our implementation each node in the cluster keeps a list of nonhome nodes GUIDs, i.e. GUIDs belonging to those objects that have their home-base somewhere

5. Implementation 37 else. Upon an unlock of a shared object, a cache flush is done as described in Section 4.2.1 that triggers the following operations: 1. Get all non-home nodes GUIDs. 2. Get the memory address of the locally cached non-home shared object by looking it up in the GUID table. 3. Check if the shared object has the WRITE set. If true, compute the diffs between the object and its twin object by comparing the object s fields word by word. 4. Unset the non-home node object s WRITE bit, i.e. the object goes into the READ state. 5. The diffs are packed into a message and sent back to the corresponding home nodes. 6. On the home node the diffs are applied to the shared object. To indicate to the home node which fields of the object have actually been changed, a bitmask is sent along with the diffs. The size of the bitmask is the amount of fields of the object. A bit set to 1 means that the field has changed and that the home node must apply the diffs for this field. The invalidation process is quite similar. Upon an acquire of a lock, all non-home shared objects are invalid. However, compared to the JMM that states that the working memory of a thread is invalidate when the thread acquires a lock, there is an issue to be considered in the distributed environment. In our DSM we could have the case that a thread T1 acquires a lock L1 and performs some writes to some non-home shared objects and before the lock is released, a thread T2 acquires another lock L2. If T1 does not release the lock L1 before T2 acquires its lock L2, T2 would invalidate all non-home shared objects which causes T1 to fault-in the object from its home node and therefore, the writes of T1 would have been lost. Therefore, before the invalidation operation upon acquiring a lock starts, all updates to non-home objects must be written back. This guarantees that T1 will see the changes it has done recently when the invalid non-home shared object is updated (see Figure 5.3). The operations triggered by an acquire of a shared object are the same as those triggered in the release operation plus an additional invalidation of all non-home shared objects. 5.3.5 Statics Since the JikesRVM is written in Java, the VM code needs an underlying runtime environment to be able to execute itself. The JikesRVM uses its own runtime environment that is also shared with the application code. One data structure that is used by both is the so-called Java Table Of Contents (JTOC), which is a table of a fixed size that contains all static variables of the VM and application code. In a VM running on a single machine, static variables can be considered as global. In the DJVM although, these static variables must be distinguished into global instances within the application and local instances which are held within a cluster node because the JikesRVM uses

38 5. Implementation N1 has cached-copy of o1 Home node of o1 N1 N2 T1 locks L1 T1 writes o1 Thread Switch T2 locks L2 T2 invalidates o1 writeback o1 T2 unlocks L2 Thread Switch T1 faults invalid o1 in T1 reads o1 T1 unlocks L1 Figure 5.3: Cache coherence protocol. some static variables to maintain its runtime support structures such as type dictionaries, thread queues, GC maps, etc. These runtime support structures are node specific and must become local to each node within the cluster. Our approach for the separation is done during classloading. Since every loaded class is represented by a VM_Class object internally, we set a flag if the class belongs to the JikesRVM package. If a class implements an interface belonging to the JikesRVM package, it will also be considered as a static local variable 4 Accesses to static fields are different from accessing field of object instances. VM internally, every static field is represented by a VM_Field object which contains an offset into the JTOC. Whenever a static field is accessed, the offset is pushed onto the stack and together with the address of the JTOC which is always held in a register in the Baseline compiler, the value can be read at or written to the corresponding JTOC slot. Because of our replication steps during classloading (cf. Section 4.5), we can ensure a consistent view of the JTOC since static fields are inserted into the JTOC during classloading. For globally accessible static fields, we decided that they are always accessed via the master node s JTOC whereas static fields that are local per node are accessed via their local JTOC. Due to the different access pattern for static fields, the faulting scheme must be adjusted because 4 In [36] they use an empty interface LocalOnlyStatic to tag all classes that are considered local-only.

5. Implementation 39 the concept of detecting shared objects does not apply for static variables. A previous idea was that every static field represented by VM_Field object should also become shared objects with the master node as the home node. The problem that arose was that some static fields might never be detected as shared. The two Examples 5.3 and 5.4 should illustrate why our shared object concept does not apply for static fields. 1 class Foo { Listing 5.3: Simple classes with static fields - Working 2 public static int staticint = 10; 3 } 4 5 class Bar { 6 public static Foo foo = new Foo (); 7 } When an instance of Bar becomes a shared object, the static field to the object of type Foo will also be marked as shared. If the static type Foo is accessed later via Bar.foo.staticInt, the object foo is faulted-in and its VM_Field representing the staticint will also be marked as shared. In this case, we know that every VM_Field representing a static field is marked as shared and must be accessed over the master node. The Listing 5.4 though shows an exception case. When an instance of type Bar is marked as shared, the VM_Field representing the staticint of Foo will not be shared since Bar has no static fields declared. However, when the method returnfoostaticint is executed, the static field of Foo must be considered as a global static but with our shared object detection scheme, it will not be detected as such. 1 class Foo { Listing 5.4: Simple classes with static fields - Not working 2 public static int staticint = 10; 3 } 4 5 class Bar { 6 public int returnfoostaticint () { 7 return Foo. staticint ; 8 } 9 } As a consequence, we introduce a different access pattern when global statics are accessed. A worker node has to send the offset into the JTOC to the master node. If a primitive type is accessed, the master node will send the value back to the worker node or store the value into its JTOC if it was a write operation. Reference types although are treated similar to our faulting scheme. When a worker node wants to read a static reference, the master node checks if the referenced object is locally available and marks the object as shared. The master node sends the GUID back to the

40 5. Implementation worker node which is then able to request the object from the corresponding home node. When a global static reference is written into the master node s JTOC, the worker node marks the referenced object as shared and then sends the GUID to the master node where the GUID address is stored into the corresponding JTOC slot by setting the last bit of the address to 1. Whenever a static reference is loaded from the JTOC, we must check if the object corresponding to this GUID is available locally or if it must be requested from its home node. 5.4 Distributed Classloader By using a centralized classloading approach on the master node and replicating the constructed type information to the worker nodes, we achieve a consistent view on all cluster nodes, i.e. if a VM_Class object with the class ID 1099 represents a loaded Java class Foo on one node, a VM_Class object with the same class ID will also represent the same Java class Foo on another node. This does also apply for the JTOC entries on each node. If for example the entry at offset 1000 refers to a static field of class Bar, we can be sure that the entry at the same offset on another node will refer to the same static field. As we have shown earlier in Figure 4.4, a VM_Class object is defined when a Java.class file is read. We have also mentioned that there are two classloaders in the JikesRVM, one is used during bootstrapping and one to load the application. When the application s classloader attempts to load an ordinary user s class file, it queries the bootstrap classloader to check if the class has already been loaded before trying the classloading itself. Since the bootstrap classloader only loads classes that are either contained in the bootimage or in the Java class library, these classes are locally available on each node because they are part of the VM image. Due to our SSI design, we require the application class files only to be located on the master node. As a consequence the distributed classloading is a bit different for the bootstrap classloader and for the application classloader: When the bootstrap classloader loads a class, the data are read from the local file. The definition of the byte stream into a VM_Class object must be done on the master node and replicated to all nodes. When the application classloader wants to load a class on a worker node, the reading process must be done on the master node because the application class file is located on the master node only. As a result, we have redirected the classloading for the bootstrap classloader in the method shown in Listing 5.5 whereas the forwarding for application classes is done in the VM_ApplicationClassLoader class presented in Listing 5.6. In particular, a message containing the class name to define or to load respectively is sent to the master node where the meta-objects are constructed and replicated to all other nodes.

5. Implementation 41 Listing 5.5: Distributed Classloading Redirection for the Bootstrap Classloader 1 class VM_ClassLoader { 2 public static VM_Type defineclassinternal ( String classname, 3 InputStream is, ClassLoader cl) throws ClassFormatError { 4 5 if( DistributedClassLoader. redirecttomaster ()) 6 return DistributedClassLoader 7. distributeddefineclassinternal ( classname, cl ); 8... 9 } 10 } Listing 5.6: Distributed Classloading Redirection for the Application Classloader 1 class VM_ApplicationClassLoader extends URLClassLoader { 2 protected Class <? > findclass ( final String classname ) 3 throws ClassNotFoundException { 4 5 if( DistributedClassLoader. redirecttomaster ()) { 6 // send classname to load to master 7... 8 } else { 9 super. findclass ( classname ); 10 } 11 } 12 } 5.4.1 Class Replication Stages Upon every state change of the VM_Class object (see Figure 4.4) on the worker node, a message is sent to the master node which also performs the state transition for the corresponding VM_Class object locally. After the state has changed, the master node will indicate the state transition to all worker nodes that perform the transition, too. At some stages, the master node needs to replicate some type information to the worker nodes. In the following we describe the most important type information being replicated in each stage: When a Java.class file is loaded into memory, the byte stream is analyzed and a constant pool that contains the names of all declared fields, methods, interfaces, super classes, sub classes, etc. is setup and processed. The corresponding meta-objects such as VM_Field, VM_Method, etc. - all having a unique ID - are created. Therefore, we store these IDs in a CachedClass object that is later used for the replication of these types. Finally, when an instance of VM_Class is created, the object is considered loaded. Replication is done by serializing each meta-object, i.e. the required information of each meta-object is fetched by using the IDs contained in the CachedClass object and serialized into a byte stream that is eventually sent

42 5. Implementation to all worker nodes. Each worker node deserializes the received information and reconstruct the meta-objects During resolution, all class members are linked which involves static fields and methods being inserted into the JTOC. As a result, we must distribute each offset in this table to the worker nodes. Instantiation involves no JTOC entry being created, thus this replication stage requires no additional information from the master node and can be done locally. Initialization of local-only classes can be done locally whereas global classes are initialized only on the master node. Therefore, the master node does not need to send any additional information to worker nodes. 5.5 Distributed Scheduling For our load balancing function we use a library to get the Load Average from the operating system. A VM internal LoadAverageThread is started on each cluster node that calls the library periodically 5. If the LoadAverageThread is running on the master node, the value is stored directly into a double array that has the length of the total amount of cluster nodes. If the thread is running on a worker node, the Load Average value is encapsulated in a message and sent to the master node where to value is inserted. To get the most idle node in the cluster, the master node calculates to minimum Load Average and returns the index that corresponds to the node ID of the node, that is most idle, back. 5.5.1 Thread Distribution In a Java application, threads are usually created and started with the instructions given in Listing 5.7: Listing 5.7: Creation and start of a thread in a Java application 1 Thread t = new Thread (..); // t = new <Thread. class >; t.< clinit >(); 2 t. start (); // t.< invokemethod >; In the comments, the instructions used internally in the VM are shown. Our thread distribution mechanism tries to improve access locality as mentioned in Section 4.6. The decision on which cluster node the thread will be running is done at allocation time of the thread. Consider an example as presented in Figure 5.4. When a worker node N1 is about to create a thread object T1, it sends a Load Average request message to the master node to get the ID of a currently underloaded node. If the returned node ID is different from the current ID of the current 5 We set the interval to 5 seconds.

5. Implementation 43 Master Node Node N1 Node N2 get LoadAvg return node-id N2 allocate T1 on N2 return GUID of T1 allocate T1, init header, share object, assign GUID allocate twin object run constructor send object data, start thread T1 report started thread T1 init object T1, start T1 ACK start thread T1 invalidate T1, map T1 with GUID Figure 5.4: Thread distribution. cluster node allocating the thread, a thread allocation message is sent to the underloaded cluster node, in our case node N2, that reserves space for the thread object, initializes the object header and makes the object a shared object. The GUID assigned to the thread object is returned back to the requesting node N1 that is now considered as the non-home node of the T1. Therefore, we allocate enough space for the twin object and execute the constructor method. When thread T1 is started, node N1 sends a message containing the initialized data of T1 to N2 which copies the data into the object s body and finally starts the thread. N2 will report to the master node that thread T1 has started and sends an acknowledgement back to N1 which then assigns the returned GUID with that object and invalidates the thread object T1. This guarantees that further accesses will cause a fault-in from the home node N2. 5.5.2 Thread and VM Termination During the boot process of the DJVM several VM daemon threads are set up and started such as the MessageDispatcher or MonitorThread 6. VM daemon threads are only allowed to terminate if all threads of the application have terminated. Because application threads can be distributed among the cluster nodes, each node sends a report to the master upon the termination of a Java application thread. If all Java threads have terminated, the VM daemon threads are allowed to terminate, shutting down the JVM 1. After the boot process described in Section 5.2, the VM_MainThreads on the worker nodes wait on a node-local object called syncmainthread. 6 The MonitorThread is a VM daemon thread that is involved in the remote locking process. It actually locks the object and sends a reply to the requesting node, cf. Section 5.6.

44 5. Implementation 2. Whenever an application thread is distributed and started on a worker node, a message is sent to the master node that increases the thread counter for that particular node, i.e. a user thread has been launched on that node. 3. Upon termination of an application thread on a worker node, a message is sent to the master node which decreases that counter for that node. 4. If the VM_MainThread of the master node is about to terminate, it checks if the counters of all worker nodes have reached the value 0, i.e. all threads from the application have terminated and the cluster is shut down. 5. The main thread of the master node sends a MessageSystemClusterExit request to all worker nodes. 6. When processing the SystemClusterExit message on the worker nodes, the MessageDispatcher thread notifies on the syncmainthread object such that the VM_MainThread is woken up. An acknowledgement that the worker node s VM is about to terminate is sent back to the master node. 7. If the master node has received all acknowledgements, it can safely terminate its VM. 5.6 Thread Synchronization As mentioned in Section 4.6.1, synchronization operations on shared objects are always executed on their respective home nodes. In our implementation (based on [17, 18, 24]), every node has a LockManager that handles all synchronization operations on either shared or non-shared objects. Because the synchronization requests lock and wait are blocking, the executing thread is suspended until the lock is acquired or until the thread is woken up, respectively, our MessageDispatcher thread receiving and processing all incoming messages of the cluster must not be blocked when handling a synchronization request. Therefore, upon receiving the synchronization message, the MessageDispatcher hands the request over to a so-called MonitorThread that takes over the processing and replying of the request. Since a MonitorThread performs the synchronization operation on behalf of the requesting thread located on a different node, the LockManager must ensure that each MonitorThread is assigned to the same remote thread, as long as the MonitorThread holds some locks. Only after having released all locks, the MonitorThread can process other synchronization requests for another remote thread. An example could be a remote thread T1 on node N1 that wants to lock a shared object whose home node is N2. T sends a lock request to the node N2 where the message is handled by a MonitorThread MT. MT locks the requested object and sends an acknowledgement back to the requesting node. After receiving the reply message, T can continue its execution. If later on T locks another object located on node N2, the LockManager on N2 must ensure that the same MonitorThread MT performs the synchronization. When thread T releases both locks, the

5. Implementation 45 Requesting Node Receiving Node Application Thread sends message and waits Acquire Lock request MessageDispatcher receives and processes the request. Arranges MonitorThread for locking. Thread Switch Acquire Lock acknowledgement MonitorThread acquires the lock and sends back acknowledgement. MessageDispatcher receives and processes the response. Wakes up the application thread. Thread Switch Application Thread resumes Figure 5.5: Acquire lock on a shared object. MonitorThread MT holds no locks anymore and is at this point allowed to perform synchronization for another remote thread. When the LockManager is booted during the boot process, we initially set up a pool of MonitorThreads. Due to the problem that we cannot differentiate between VM and application objects, we rely on the fact that java.lang.thread objects are only created by the application since VM threads do not inherit from java.lang.thread When an application thread is started, the thread is assigned a GUID that is used for the synchronization requests. Initially, the LockManager takes a MonitorThread from its pool (if no MonitorThreads are left in the pool, a new instance is created) and associate this daemon thread with the GUID. In this way, we guarantee that a remote thread is mapped with the same MonitorThread as long as the MonitorThread holds some locks. The MonitorThread object has an internal list that stores all currently locked objects. Only when the list becomes empty, the MonitorThread is given to the pool such that it can be associated with another remote thread upon the next synchronization request. 5.7 I/O Redirection A Java runtime environment needs to provide the application with an implementation of the Java class libraries. The JikesRVM uses the GNU Classpath package which is an open-source implementation. Because GNU Classpath does not implement some specific VM operations, the JikesRVM implements these as a glue layer which allows the library code to access the VM directly. The VMChannel class is responsible for all I/O operations. Since all classes in the GNU Classpath package call the methods of the VMChannel which in turn either uses some JikesRVM specific implementations or performs a native system call to deal with I/O, I/O redirection for the DJVM can be done in the VMChannel class. For this purpose, before a VM-internal or system call for I/O is executed, we redirect the I/O operation to our IORedirector where we check if the I/O operation is

46 5. Implementation executed on the master or on a worker node. In former case, I/O can be executed directly whereas in the latter the I/O operations must be forwarded to the master node by providing the necessary parameters. 1 class VMChannel { 2 public int read () { Listing 5.8: I/O redirection in VMChannel 3 int result = IORedirector. redirectreadbyte ( fd ); 4 return result ; 5 } 6 } 1 class IORedirector { Listing 5.9: Pseudocode for I/O read redirection 2 public static int redirectreadbyte ( fd) { 3 if ( isworkernode ) { 4 MessageIORead msg = new MessageIORead ( fd ); 5 msg. send ( masternode ); 6 return msg. getbytes (); 7 } else { // master node 8 return read ( fd ); 9 } 10 } 11 } In Listing 5.9 a sketch from a I/O read operation is given. The worker node constructs a read message of type MessageIO and passes the file descriptor number as a parameter. The message is sent to the master node and where the I/O read operation on the given file descriptor is executed. The result is sent back to the worker node which then returns the read result. Since the runtime data structures of the JikesRVM are located in two.jar files, we allow the worker nodes to read from a local copy of these two libraries for performance reason. Internally, a file descriptor is represented by a State object which contains the native file descriptor number. We added a flag to this object to indicate that a file descriptor becomes local. When one of the files jksvm.jar or rvmrt.jar is opened, we set this flag to true such that further accesses can be done locally instead of going through the master node. Currently, we have implemented I/O redirection for operation on files but have left them out for sockets since they were not relevant for our benchmarks. Note that the VM provides methods such as syswrite() that in turn perform a system call. However, some VM-internal classes executes System.out.println() for debugging outputs. In the context of the DJVM, this would result in I/O redirection for the latter case, whereas in the former case the output is printed on the local node. Since these does only apply to classes that are not involved in the DJVM, we did not adjust them.

6 Benchmarks In this chapter we measure our Shared Object Space s overhead that results from the mechanisms we added to our distributed runtime system such as remote thread allocation, classloading, synchronization and I/O redirection. 6.1 Test Environment We conducted our performance evaluation on a small cluster consisting of two nodes. The master node runs on a Intel Core Duo 3GHz CPU workstation with 2GB memory. The used operating system is Ubuntu 8.04 with the generic Linux kernel 2.6.24. A worker node was configured on a Intel Core Duo 2.5GHz laptop with 3GB memory, also running Ubuntu 8.04 with the Linux kernel 2.6.24. Both nodes are interconnected by Fast 1Gbit Ethernet network cards and are located in the same subnet of the local area network. Since we optimized the TCP/IP communication in our implementation, i.e. message acknowledgements containing no data can be omitted because of the reliability of TCP 1, we conducted our benchmarks of our distributed JVM based on TCP sockets. An image of the DJVM that has been compiled with the BaseBaseNoGC 2 configuration has been installed on both nodes. The Java application is located on the master node only and started there. 6.2 Test Application Suite Due to the lack of garbage collector support, benchmark suites such as DaCapo 3 or SpecJBB 4 resulted in an out-of-memory exception because these application suites allocate too many objects so that the maximum heap size of 800MB of the JikesRVM is exceeded, therefore we decided to use Java Grande [2]. Unfortunately, we experienced problems when using this benchmark suite. We strongly believe that this has to do with the used synchronization based on volatile fields 5. In our opinion the specification about volatile fields is a bit blurred. The JMM [21] states that 1 When using raw sockets, these acknowledgements must be sent back to let the sender know that the message has been received. 2 The Baseline compiler is used for the bootimage writing and during runtime of the VM and no garbage collector is supported. 3 http://www.dacapobench.org/ 4 http://www.spec.org/jbb2000/ 5 Benchmark applications such as the SimpleBarrier using the keyword synchronized for the synchronization run without any problems. 47

48 6. Benchmarks operations on the master copies of volatile variables on behalf of a thread are performed by the main memory in exactly the order that the thread requested, meaning that the keyword volatile is a guarantee that all memory loads and stores happen in the order they are specified. In the Java Grande benchmark suite, threads synchronize on a barrier containing a boolean array member declared as volatile. It is implied that a running thread always loads a volatile field from main memory and immediately writes it back to memory upon a store. But this is not exactly what the JMM actually specifies. In [4], the current JMM is being redefined and it is proposed that the semantics of volatile variables should be strengthened to have a acquire and release semantics. A read to a volatile field has the acquire semantics whereas a write to a volatile field has the release semantics. The Baseline compiler in the JikesRVM does not treat volatile fields other than normal fields because the Baseline compiler translates each bytecode into a series of instructions on the host machine. The order of the instructions and the memory access pattern are therefore maintained. However, the acquire and release semantics are not applied. In order to run the Java Grande benchmark suite correctly in context of our distributed JVM, we assume that additional software checks of volatile fields must be inserted. Upon a load of a volatile field whose declaring object is not on its home node, the value must be fetched from the home node. Furthermore, a store to a volatile field must trigger a writeback to its home node. Because we have not implemented the volatile field checks, we finally decided to evaluate the performance with some custom benchmark applications that show the overhead of our distributed runtime system. In particular, we measured the overhead for: Accessing several objects that are node-local shared and located on their home node shared, located on a non-home node and have not been requested shared, located on a non-home node and have already been cached Allocating and starting a remote thread Loading classes on the master node on a worker node Synchronization on objects I/O redirection from worker node to master node

6. Benchmarks 49 6.3 Performance Evaluation First it should be mentioned that due to the lack of a proper debugger support, we used system calls to print the debugging outputs onto the node s screen during development. These debugging outputs are declared within a condition if (currentlevel <= DEBUGLEVEL). For the performance evaluation we have set the constant DEBUGLEVEL to 0 so that no debugging information are printed. However, since the Baseline compiler does not perform optimization such as dead code removal, we have an overhead of a compare instruction for every debug instruction inserted in the code. By removing all debug statements, the performance should become slightly better. In our performance test, we measured the overhead for accessing different objects as shown in Listing 6.1. 1 public void run () { Listing 6.1: Worker thread accessing objects 2 long [] elapsedtime = new long [ 2]; 3 for ( int j = 0; j < elapsedtime. length ; j ++) { 4 long start = System. nanotime (); 5 for ( int i = 0; i< objs. length ; i ++) { 6 ++ objs [i]. id; 7 } 8 long end = System. nanotime (); 9 elapsedtime [ j] = end - start ; 10 } 11 for ( int j = 0; j < elapsedtime. length ; j ++) { 12 System. out. println ( elapsedtime [j ]); 13 } 14 } The first row in Table 6.1 shows the time for the unmodified JikesRVM for accessing 100, 1000 and 10000 objects respectively. The access time for node-local objects in our DJVM is higher because of the additional software checks we added to the compiler. For node-local objects, we have a load instruction for the status header, a compare instruction if the object is shared and finally a jump instruction which is responsible for the overhead. Accesses to shared objects require an additional check of the invalid state (see Section 4.4.4) and are therefore slower than accesses on node-local objects. The faulting-in of shared objects is a costly operation that needs about 2 ms per object 6. Once a shared object is cached, the access time is similar to shared objects located on their home node. We compared the execution time for allocating and starting a thread on the master node itself against the distribution of a thread to a remote node. The overhead is a result of the remote allocation and the remote initialization messages being sent to the remote node (see Section 5.5.1). In Table 6.2, we also show the time required for classloading on the master node and the time when 6 A request message is sent to the home node where the object data is copied into the response message. After receiving the reply, the requesting node allocates space for the object and deserializes the data.

50 6. Benchmarks 100 objects 1000 objects 10000 objects original 2.12 µs 15.93 µs 151.66 µs node-local 7.03 µs 52.1 µs 554.51 µs shared home node 7.45 µs 66.32 µs 584.79 µs shared non-home node (miss) 260 ms 2379 ms 23313 ms shared non-home node (cached) 8.45 µs 75.85 µs 701.79 µs Table 6.1: Access on node-local and shared objects. a worker forwards the classloading to the master node. Finally, the last row in the table presents the execution time for a simple System.out.println() on the master node and on the worker node that must copy the byte stream into a message and sent it the the master node. Master node Worker node Thread allocation 0.46 ms 155.8 ms Classloading 3.38 ms 15 ms I/O redirection 0.11 ms 12.26 ms Table 6.2: Overhead thread allocation, classloading and I/O redirection. In the last performance test we ran a simple producer-consumer application. A writer thread synchronizes on an object, that is located on the master node, and writes into a field. Then, it sets a flag and notifies a reader thread that reads this value and signals the writer to write new data. Listing 6.2 shows the run() method of the reader thread. Listing 6.2: Reader thread 1 public void run () { 2 int [] values = new int [1000]; 3 long start = System. cu rrentt imemil lis (); 4 for ( int i = 0; i < 1000; i ++) { 5 synchronized ( obj ) { 6 while (! obj. isreadable ()) { 7 try { 8 obj. wait (); 9 } catch ( InterruptedException e) {} 10 } 11 values [i] = obj. getcounter (); 12 obj. setwritable (); 13 obj. notify (); 14 } 15 } 16 long end = System. cu rrentt imemil lis (); 17 long elapsed = end - start ; 18 System. out. println ( elapsed ); 19 } We modified our scheduling load balancing function to explicitly allocate a certain thread on a particular node so that we could test four different cases:

6. Benchmarks 51 1. The reader and writer thread are both located on the master node. 2. The reader thread is located on the master node, the writer thread resides on the worker node. 3. The reader thread is allocated on the worker node, the writer thread remains on the master node. 4. Both threads are allocated and started on the worker node. Note that the synchronization object is located on the master node in any case. Figure 6.1 shows the result of all four test cases. If both threads are allocated on the same node as the synchronization object, the execution time is less than 20 ms. However, when one thread is located on a worker node, upon each acquire operation the synchronization object must be invalidated and updated before the next access. It even becomes more expensive if both threads are located on the worker node which results in twice the amount of updates and diffs propagation. This examples shows the cache flush of the DSM is much more expensive than its local counterpart. Figure 6.1: Thread synchronization time.

7 Conclusions and Future Work In this last chapter we discuss the problems we encountered during our implementation of the distributed JVM. We also talk about several optimizations to our Shared Object Space to improve performance. We list some further work and we give a conclusion about our system at the end. 7.1 Problems As we have mentioned in Section 5.3.5, the JikesRVM uses the compilers to generate native machine code from the application bytecode and from the VM code and uses its own runtime system to execute itself. On one hand significant performance improvements can be achieved through close integration of the Java application and the VM object space. On the other hand the boundary between the application and VM objects is blurred which results in difficulties especially concerning distributed shared memory. To understand the problem more clearly, we first explain how the application and the VM actually interact with each other. We have seen in Section 5.7 that the JikesRVM runtime system provides the Java application with the GNU Classpath Java library. When an application object is created, the code from Classpath is executed. To interact with the VM, some library objects are used as adapters to VM objects. Consider Figure 7.1: If a Java application creates a java.lang.thread object, a corresponding adapter object java.lang.vmthread is created. When the java.lang.thread is finally started, it invokes a method call in java.lang.vmthread that in turn performs a call on the VM object org.jikesrvm.scheduler.vm_thread. java.lang.thread #vmthread : java.lang.vmthread java.lang.vmthread #vmdata : java.lang.vmthread org.jikesrvm.scheduler.vm_thread Figure 7.1: Thread representation The difficulties due to the lack of a clear VM and application object separation are quite obvious. Since some VM objects such as org.jikesrvm.scheduler.vm_thread should always be considered local because it is used in the cluster node s local thread queue for scheduling, this object should 53

54 7. Conclusions and Future Work not become a shared object. However, if the application performs a call to a method defined in java.lang.thread, that is a non-home shared object, the corresponding method in the VM object will be invoked, resulting in a fault-in of the internal org.jikesrvm.scheduler.vm_thread object from the home node. Problems arise for method calls such as join() defined in java.lang.thread. By executing this method, the calling thread waits until it is woken up when the other thread has terminated. However, in our example the calling thread joins on a cached copy of the VM object that is only scheduled on its home node. The calling thread is never woken up since the home node does not know that it has to wake up the joining thread on the other node. This issue has a negative impact on the performance of our system: If a local VM object becomes a shared object, further synchronization operations on it result in unnecessary invalidations of the other cached non-home shared objects so that they have to be updated. The cache flush problem also appears in the case when a static synchronized method is called that belongs to the Java class library. The synchronization is done on the single instance of type Class. Because we defined all Class objects to be shared, this results in a cache flush of all non-home shared objects even when the static synchronized method was called within the VM 1. Fortunately, the internal org.jikesrvm.scheduler.vm_thread object does not inherit from java.lang.thread whose instances are only started by the Java application 2. Therefore, we rely on the fact that every started java.lang.thread object must be an application object. By introducing several exception cases we try to prevent VM objects to become shared objects by using the package name of the class. This approach is not clean because a java.lang.integer object could either be used by the VM or by the application. Therefore, a real separation is encouraged. The issue for separating the heap for VM objects is known and several ideas have been discussed 3, but no real implementation has been done so far. In [29] the approach is based on name space separation by the classloader. In particular, they make use of two initial classloaders: The bootstrap classloader loads the initial bootstrap code and the Java class libraries for the applications use, and the VM classloader that is used for loading the VM classes and all classes used by the VM, including the Java class library. By preventing the VM to pass the VM classloader object to the application code, the application cannot access any of the VM classes so that the required isolation is achieved. By introducing an interface layer called the Mu 4 -layer that is located between the VM and the application, the application is able to interact with the VM through so-called Mu-objects. It should be mentioned that JavaSplit pursues a similar approach since the added runtime logic is also written in Java, thus they have similar issues for separating VM and application objects. In JavaSplit, the whole class hierarchy is instrumented and then replicated in a sub space of the class name space by adding the javasplit prefix [16]. The application uses this replica transparently while the instrumentation code and the DSM operate on the original class hierarchy. As the latest 1 E.g. if one would use System.out.println(String s) within the VM code, this would lead to a cache flush since a static synchronized method is executed in the call chain. 2 Inside the VM java.lang.thread could be used of course since the corresponding VM objects are created automatically. 3 http://jira.codehaus.org/browse/rvm-399 4 Coming from the Greek letter µ.

7. Conclusions and Future Work 55 version of the djvm 5 (see Section 3.3.4) also uses bytecode rewriting techniques, we assume that they instrument the application code similarly. Since the JikesRVM only supports partial debugging with the GNU debugger (GDB) by inspecting object methods within the bootimage, debugging methods that are compiled after the execution was difficult. Especially in our distributed and multi-threaded environment, a proper debugging environment is desirable. A Google Summer of Code 2008 project 6 has been announced to implement Java Debug Wire Protocol and Java Virtual Machine Tool Interface support. 7.2 Future Work In this section we will discuss future work that could be implemented to extend our prototype of a distributed JVM. We propose several optimizations, some of them directly related to our Shared Object Space - they utilize the available runtime information of our object-based DSM embedded in the VM to improve performance. One example of such an optimization is the Object Home Migration concept. 7.2.1 Object Home Migration The idea behind Object Home Migration is to migrate a shared object from its initial home to another node in the cluster. If a DSM application uses a single-writer access pattern, i.e. the shared object is only updated by one thread for a certain time interval, it makes sense to declare the node, where the thread is located, as the home node of the shared object. Upon acquire and release operations on shared objects that have migrated their home base, the invalidation process according to the HLRC protocol is not required anymore. Thus, the communication overhead for the necessary update of the shared data and for the propagation of the diffs are avoided. Also, the diffs do not need to be calculated and applied which saves the additional overhead for memory access. Because our Shared Object Space is embedded in the VM, the runtime system knows about every request of shared objects and their writebacks to the corresponding home nodes. A single-writer access pattern can therefore be detected since object requests, i.e. the requesting node faults the shared object in, can be considered as read operations and diffs propagations are regarded as write operations. In [18, 33] a threshold concept is proposed that decides when a home migration should take place. The detection of a single-writer access pattern is done on the home node that increases a counter for diffs propagations from a certain cluster node. To avoid additional overhead for the bookkeeping, only consecutive writes by that particular cluster node are considered. If the counter exceeds a certain predefined threshold, the DSM knows that the last write operations came from the same node, thus that cluster node should become the new home node of the shared object. 5 The djvm developed at the Australian National University in Canberra should not be mistaken for our DJVM. 6 http://jira.codehaus.org/browse/rvm-33

56 7. Conclusions and Future Work Interleaving diffs propagations from other nodes (including the home node itself) will cause the counter to be reset. The Home Migration notification mechanism should be designed carefully. A shared object being migrated to another node must be considered as invalid on the current home node. In the meantime however, a node could fault that shared object in from its obsolete home. Therefore, a proper mechanism for notifying the new home node of that shared object is needed. A simple way is to send a broadcast message to all other nodes after the object has been migrated. If a fault-in request arrives during home migration, i.e. before the broadcast message is sent, the requesting node must be blocked somehow until the new home is broadcasted, then the fault-in can be repeated by contacting the new home node. Note that threads located on the obsolete home node must also wait until the migration has completed successfully. Concerning our GUID concept that contains the ID of the home node, the GUID must be changed so that further fault-ins will result in contacting the new home node. Another mechanism proposed in [17] uses forwarding pointers that are installed on the former home node to point to the new home node. Upon a home miss the requesting is redirected to the new home via the given forwarding pointer. The advantage is that the thread on the requesting node does not have to wait. However, drawbacks exist when an object has moved its home several times, resulting in a chain of forwarding pointers which leads to several levels of indirections before reaching the new home. With this mechanism the GUID of the object does not need to be changed because the forwarding pointer is responsible for the indirection. Since both mechanism are connected with overheads, the choice of the home migration notification depends on the memory access pattern on the global object space, i.e. if many nodes need to access an object on the new home, broadcasting should be used to inform all nodes about the new home whereas a forwarding pointer could be beneficial if only a small amount of nodes require access to that object to avoid the overhead of broadcasting. As discussed, Object Home Migration is an expensive mechanism. Therefore, it is reasonable to migrate an object s home only when a single-writer access pattern is detected. In the case of home migration being done in a multiple-writer access pattern, this could lead to a great amount of home movements. 7.2.2 Object Prefetching Another optimization is based on object connectivity and the available information about requested objects in our Shared Object Space. The idea is to prefetch multiple objects when a shared object is faulted-in. The prefetching can be done since the object s connectivity graph is known during runtime. Consider a shared object o1 that has a reference to another shared object o2, both having their home on the same node. If a requesting node faults-in object o1 and later accesses o2 that leads to another fault-in, the DSM can keep track of these particulary requests. If further requests from other nodes for object o1 arrive, the home node utilizes the object connectivity and the previous faulting requests and sends object o1 together with o2 to the requesting node that

7. Conclusions and Future Work 57 avoid the communication overhead of a second fault-in for o2. In [18] an optimal message size is defined. If an object is requested, all reachable objects (the requested object is considered as their root) are copied into the message until it reaches the predefined size, thus reducing the negative impact of prefetching large objects such as arrays of reference types. 7.2.3 Thread Migration Our scheduling of threads is based on a static load balancing mechanism by distributing the Java application thread to the most underloaded node before the thread is started (cf. Section 4.6). In other distributed JVM systems such as [24, 35], a transparent thread migration mechanism has been implemented to achieve dynamic load balancing that is more accurate than the static approach we use. The idea behind thread migration is to relocate a running thread to an idle node to reduce the CPU utilization on the previous node the thread was running. First, the running thread needs to be stopped from its execution so that its state can be captured. This state is transferred to another node where a new thread is created and initialized with the transferred state before the thread is finally started, i.e. resuming its execution. If we consider thread migration in our DJVM context, we have to analyze how the various runtime data areas such as the object heap, the JVM stacks and the method areas 7 (defined in [21]), that together form the execution context of a Java thread, are handled. References from the thread object into the object heap are easy to handle by sharing the objects that are referenced, i.e. moving them into the Shared Object Space. During migration only the GUIDs of these objects need to be transferred since our DSM makes sure that further accesses to these objects are faulted in correctly. With our centralized classloading approach, each node has a consistent view. A migrated thread therefore will have the same TIB that contains pointers to the same methods and to the type information. The problem to be solved is the extraction and restoration of the migrated threads. A thread migration framework based on an older version of the JikesRVM 8 is described in [25, 26]. In these approaches the stack is captured in frames and the original call stack up to the next instruction to be executed is re-established, but the used techniques are beyond the scope of our thesis. 7.2.4 Fast Object Transfer Mechanism Our communication model is designed to support communication over Ethernet frames and TCP. Since raw sockets require a byte array as data input, each message is converted into a packet of bytes before being passed to the socket. When a shared object is requested by a node, the home node copies the object into a byte buffer, adds the protocol header (cf. Section 4.3.1) and sends the resulting packet over the socket. The JikesRVM allows to access memory directly by using methods of the VM_Magic class. A call to these methods results in assembly instructions being emitted by the compiler to access certain part of the memory. Since TCP sockets are stream-oriented this can be combined with the direct memory access possibility to achieve a fast object transfer mechanism 7 The method area includes all classes loaded by the JVM and their compiled methods. 8 2.4.6

58 7. Conclusions and Future Work as shown in [14]. Instead of copying the object into a byte buffer, the object can be directly sent from memory as a stream which avoids the copying overhead per requested object. To achieve a consistency in the messaging model between the raw socket and TCP socket communication (cf. Section 5.1) the fast object transfer mechanism has not been implemented. 7.2.5 VM and Application Object Separation As we have discussed thoroughly in Section 7.1, the lack of a clear boundary between VM and application objects results in many problems for the DJVM. VM objects could get shared unintentionally so that further synchronization operations on them triggers the cache coherence protocol that is quite expensive and thus reducing the performance significantly. In the mentioned section we have shown several approaches to achieve a proper separation of VM and application objects. 7.2.6 Garbage Collector Component Due to the Shared Object Space a local garbage will not suffice since a shared object on its home node cannot be reclaimed if there may be a remote reference to it on another node. Therefore, a garbage collector must be extended by a component that detects such remote references and moves objects out of the Shared Object Space if they are only reached by local threads. Thus, node-local objects can be processed by the local garbage collector only. In Section 4.8 we have outlined an algorithm for a component that works with any garbage collector and should be implemented to execute the DJVM in a productive environment. 7.2.7 I/O Redirection for Sockets As we have mentioned in Section 5.7, we have left out the implementation for forwarding I/O operations on socket file descriptors. To achieve a complete SSI, the corresponding operations should be forwarded to the master node that handles all I/O. Since the VMChannel class also contains the code for socket operations, the according redirection code can be inserted there. 7.2.8 Volatile Field Checks As described in Section 6.2 a similar implementation as for acquire and release should be added for volatile fields so that synchronizations upon volatile fields work correctly. 7.2.9 Trap Handler Detection of Faulting Accesses Instead of using software checks added to the compiler, the trap handler of the JikesRVM could be utilized for detecting faulting accesses.

7. Conclusions and Future Work 59 7.2.10 Software Checks in the Optimizing Compiler Due to the complexity of the optimizing compiler of the JikesRVM, we added the software checks to the baseline compiler. Since the optimizing compiler produces high quality code which improves the performance significantly, it is recommended to add these checks to optimizing compiler, too. In [17] they showed in their cluster-based that the JIT compilation mode performs much better compared with the interpretation mode. 7.3 Conclusions In our work we have modified the JikesRVM to build a prototype of a distributed Java Virtual Machine. Rather than running the VM on top of a page-based DSM software that is responsible for object allocation and manages the cache coherence issues, we decided to develop our Shared Object Space inside the JikesRVM. Our object-based DSM is able to use the abundant runtime information of the JVM which gives the opportunity for further optimizations. With the implementation of the lazy detection scheme of shared object, we only move objects into the Shared Object Space if they are reachable from at least two threads located on two different nodes. We added a faulting scheme so that a copy of a shared object can be requested from its home node if the object is not locally available. Our cache coherence protocol handles the synchronization on objects correctly as specified in the JMM. Upon an acquire operation the protocol guarantees that the latest data of a shared object is fetched. A release operation triggers a writeback of all previously written objects to their home nodes. By separating objects into shared and node-local objects, we only have to deal with shared objects specially since node-local objects for example can be reclaimed by the local garbage collector. Furthermore, synchronization on node-local objects do not trigger a distributed cache flush that is quite expensive as shown in Section 6.3. We have also added mechanism for distributed classloading by using a centralized classloader that replicates the class definitions to all worker nodes. We implemented a distributed scheduler that includes a load balancing function to help us to decide where to allocate and start a Java application thread by choosing the most underloaded node. Additionally, by redirecting all I/O operations to the master node we hide the underlying distribution from the programmer and the Java application itself and gain full transparency. Since we do not introduce any additional Java language constructs, we achieve a complete SSI. We have shown in our benchmarks section that our Shared Object Space works correctly. Shared objects are faulted in when they are needed and the lazy detection scheme makes sure that objects that are reachable from other threads located on different nodes become shared objects. The distributed classloading makes sure that each node has a consistent view of the class definitions. Our cache coherence protocol deals with synchronization correctly as long as monitors are used. However, to run our DJVM in a productive environment, a garbage collector component needs to be implemented to deal with shared objects. Furthermore, for performance reasons as seen with the synchronization benchmarks, it is advisable to develop a mechanism for migrating an object s

60 7. Conclusions and Future Work home node. As mentioned in Section 7.2.8, additional checks for volatile fields need to be inserted in the compiler in order to completely cover the thread synchronization used in Java. Furthermore, the java.lang.thread wrapper class needs to be adjusted to delegate all method calls correctly to its corresponding VM-internal representation. For the benchmarks we have only rewritten the join() and start() methods. As far as classloading is concerned, we left out annotation support since the meta-objects created for annotations are treated differently. We did not consider the case when the application creates an own classloader. In this case the classloader object should be allocated on the master node and a proxy hat performs the redirection should be generated on the worker nodes. We experienced several problems during the development of the DJVM. First we encountered problems regarding deadlocks. At some point we had to block the execution of a thread until a reply message arrives, e.g. when a worker node redirects the classloading to the master node, the requesting thread must wait until all class definitions arrive. Since the requesting thread could have acquired a lock before, this resulted in a classical deadlock situation. Due to the lack of proper debugger support it was difficult to detect the exact place where the deadlock happens. We also had problems especially when it comes to the VM and application object separation as mentioned in Section 7.1. We had to introduce exception cases so that VM objects do not become shared objects. Because several static methods in the Java class library are synchronized, this results in a costly cache flush of non-home shared objects even if the method was called inside the VM code. These problems can be solved by defining a clear boundary between VM and application objects.

A Appendix A.1 DJVM Usage This section describes how to run the DJVM. We will give a list of the additional command line options we introduced for our distributed runtime system. Note that one machine in the cluster should be configured as the master node and must be started first. The other cluster nodes should be configured as worker nodes. -X:nodeId=x: The node s ID within the cluster. x must be 0 on the master node. -X:totalNodes=x: The total amount of nodes within the cluster. Currently, we support up to four cluster nodes. -X:nodeAddress=x: The IP or MAC address of the node. -X:useTCP=x: If TCP communication should be used, x should be set to true. -X:nodeListeningPort=x: The listening port where the other nodes can connect to establish a communication channel. x should have the same value on all nodes. -X:masterNodeAddress=x: The IP or MAC address of the master node. This command line option is only needed on the worker nodes. A sample configuration to run a machine as the master node using TCP communication within a cluster consisting of two nodes could look like this: rvm -X:nodeId=0 -X:totalNodes=2 -X:nodeAddress=129.132.50.8 -X:useTCP=true -X:nodeListeningPort=60000 The worker node could be started with the following arguments: rvm -X:nodeId=1 -X:totalNodes=2 -X:nodeAddress=129.132.50.9 -X:useTCP=true -X:nodeListeningPort=60000 -X:masterNodeAddress=129.132.50.8 Note that the use of raw sockets requires root privileges. 61

62 A. Appendix A.2 DJVM Classes In this section we give a short description of the most important classes used by the DJVM. A.2.1 CommManager The CommManager is the first component of the DJVM that is booted. It sets up the cluster by establishing a connection to all cluster nodes and by starting the MessageDispatcher thread that resides in a big loop to receive further messages. The CommManager also provides methods for sending and receiving messages over TCP or raw sockets. A pool of MessageBuffer s is created when the CommManager is booted. A MessageBuffer can be obtained by calling the method getmessagebuffer() and should be returned with returnmessagebuffer(). A.2.2 Message Figure A.1 shows the message class hierarchy of our message model. All shown classes are abstract. Concrete instances inherit from one of the abstract classes, e.g. a message that deals with classloading inherits from MessageClassLoading whereas a synchronization messages inherit from MessageObjectLock, etc. The methods that must be implemented from the subclasses are described in Section 5.1. Message MessageClassLoading MessageObjectAccess MessageScheduling MessageSystem MessageIO MessageObjectLock MessageStatics Figure A.1: Message class hierarchy. A.2.3 SharedObjectManager The SharedObjectManager is responsible for all operations executed on shared objects. It provides methods for sharing an object, faulting an object in, computing and applying diffs, invalidating all non-home shared objects, updating invalid shared objects, etc.

A. Appendix 63 A.2.4 GUIDMapper The GUIDMapper contains the GUID table as a member and provides methods for finding an object given its GUID. The method findorcreateguidobj() is the only way to find or create a GUID object. All GUID objects are stored in a set. This is needed since our dangling pointers are actually references to GUID objects whose last bit is set to 1. If the GUIDMapper did not hold a valid reference to the GUID object, the garbage collector could collect a GUID object since the dangling pointer did not point to that object. A.2.5 LockManager The LockManager implements the functionalities for locking and unlocking of shared and nodelocal objects. Additionally, it is responsible for the mapping between a MonitorThread and the remote thread acquiring or releasing the lock on an object. Due to some name space conflict, the implementations for waiting and notifying on objects have been moved into the VM_Thread class. A.2.6 DistributedClassLoader All methods for replicating classes are implemented in the DistributedClassLoader class. It contains all serialization and deserialization methods for the class definitions such as serializeatoms, serializetyperefs, serializefields, etc. During boot time, the flag redirectedtomaster is set on the worker nodes so that all subsequent classloading attempts are forwarded to the master node. A.2.7 DistributedScheduler The functionalities for distributing a thread to an other node are implemented in the DistributedScheduler class. A native method is declared as a member to call the load average library. The logic of the distributed scheduling is also implemented in this class, i.e. if a thread is allocated and started on a remote node, the counter for this particular node is increased and decreased if the thread terminates. A.2.8 IORedirector The methods declared in the IORedirector intercept the calls executed in the VMChannel class if the node is a worker node. In this case the corresponding redirection method constructs a MessageIO message that is sent to the master node for further processing.

64 A. Appendix A.3 Original Task Assignment Department of Computer Science Laboratory for Software Technology Distributed JVM Master Thesis for Ken Lee February 2008 Introduction Current VMs are limited to a single machine. But due to the virtual design they could easily be extended to multiple machines. One of the key features of software VMs like the JVM or.net VM is that they do not have any special hardware bounds. The general idea of a disributed VM is to extend an existing VM by a global object space. Objects are moved into this space if they can be accessed by multiple threads on different machines. Threads can be scheduled on different machines based on load and locality information. Tasks Your task is to implement this global memory space, including semantics for a simple garbage collector, object locking and access. Subtasks would be a signal protocol to synchronize class loading and resolving on all machines. Inform yourself about the ideas behind JavaSplit and other approaches to distributed (J)VMs and categorize the differences to our approach. Get used to the code of Jikes RVM and the Memory Management Toolkit (MMTk). Add a distribution subsystem to the Jikes RVM: Implement a shared object space (adapt MMTK, hash code, object lookup). Add object synchronization so that local objects can move to a shared object space. Add a mechanism to the class loader to support distributed class loading (e.g. only first worker node will run initialization code). Support distributed object locking. Develop a scheduling mechanism to distribute threads/the load over participating nodes. Add I/O Redirection so that the master node takes care of all I/O. Add a GC component that scans shared objects and makes them local if they are no longer shared between multiple nodes. Develop micro benchmarks to test your development. Figure A.2: Original task assignment page 1.

A. Appendix 65 Use these micro benchmarks and other multithreaded benchmarks like specjbb and JavaGrande to test and rate yout implementation. Test your kernel extensions against standard benchmark and optimize your system accordingly for speed. Write your thesis Prepare for the presentation in front of the group General instructions Prepare an overview on the subject and a time plan after two weeks. The project should be completed by August 2008. The course of work should be discussed regularly with the assistant. The deliverables of this work are (1) the source code, (2) a report, and (3) a one page summary. Where applicable, the source code should be provided in the form of a patch relative to the modified initial code (e.g., in the case of Linux kernel modifications). The report should be phrased as a scientific essay and should be submitted as two paper copies, and in PDF format. The preferred language is English, but at the preference of the student it can be written in German. The one page summary should be written in English and provided as an XML file. A template is provided at http://www.lst.inf.ethz.ch/teaching/sada_template.xml. There should be an oral presentation at the end of the project. Professor: Assistant: Prof. T. Gross Mathias Payer 2 Figure A.3: Original task assignment page 2.

Bibliography [1] GNU Classpath. http://www.gnu.org/software/classpath/. [2] Java Grande Benchmark Suite. http://www2.epcc.ed.ac.uk/computing/research_ activities/java_grande/index_1.html. [3] Jikes RVM user guide. http://jikesrvm.org/user+guide. [4] JSR 133: Java Memory Model and Thread Specification Revision. http://jcp.org/en/jsr/ detail?id=133. [5] Linux Load Average. http://www.luv.asn.au/overheads/njg_luv_2002/luvslides.html. [6] Linux network performance. http://aschauf.landshut.org/fh/linux/udp_vs_raw/index. html. [7] B. Alpern, C. R. Attanasio, J. J. Barton, M. G. Burke, P. Cheng, J.-D. Choi, A. Cocchi, S. J. Fink, D. Grove, M. Hind, S. F. Hummel, D. Lieber, V. Litvinov, M. F. Mergen, T. Ngo, J. R. Russell, V. Sarkar, M. J. Serrano, J. C. Shepherd, S. E. Smith, V. C. Sreedhar, H. Srinivasan, and J. Whaley. The Jalapeño virtual machine. IBM Syst. J., 39(1):211 238, 2000. [8] B. Alpern, C. R. Attanasio, A. Cocchi, D. Lieber, S. Smith, T. Ngo, J. J. Barton, S. F. Hummel, J. C. Sheperd, and M. Mergen. Implementing Jalapeño in Java. In OOPSLA 99: Proceedings of the 14th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, pages 314 324, 1999. [9] Y. Aridor, M. Factor, and A. Teperman. cjvm: a Single System Image of a JVM on a Cluster. In In Proceedings of the International Conference on Parallel Processing, pages 4 11, 1999. [10] Y. Aridor, M. Factor, and A. Teperman. Implementing Java on Clusters. In Euro-Par 01: Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing, pages 722 731, 2001. [11] J. E. Baldwin. Structuring Extensions in System Infrastructure Software using Aspects. Master s thesis, University of Victoria, 2004. [12] S. Blackburn, R. Garner, and D. Framption. MMTk: The Memory Management Toolkit, 2006. 67

68 Bibliography [13] S. M. Blackburn, P. Cheng, and K. S. McKinley. Oil and Water? High Performance Garbage Collection in Java with MMTk. In ICSE 04: Proceedings of the 26th International Conference on Software Engineering, 2004. [14] S. Chaumette, P. Grange, B. Metrot, and P. Vigneras. Implementing a High Performance Object Transfer Mechanism over JikesRVM, 2004. [15] M. Factor, A. Schuster, and K. Shagin. JavaSplit: A Runtime for Execution of Monolithic Java Programs on Heterogeneous Collections of Commodity Workstations. In CLUSTER, pages 110 117, 2003. [16] M. Factor, A. Schuster, and K. Shagin. A Distributed Runtime for Java: Yesterday and Today. ipdps, 06:159a, 2004. [17] W. Fang. Distributed Object Sharing for Cluster-based Java Virtual Machine. PhD thesis, University of Hong Kong, 2004. [18] W. Fang, C.-L. Wang, and F. C. M. Lau. Efficient Global Object Space Support for Distributed JVM on Cluster. In International Conference on Parallel Processing, 2002. [19] R. J. Garner. JMTk: A Portable Memory Management Toolkit, 2003. [20] P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy release consistency for software distributed shared memory. SIGARCH Comput. Archit. News, 20(2), 1992. [21] T. Lindholm and F. Yellin. The Java Virtual Machine Specification, Second Edition. Addison Wesley, 1999. [22] M. Lobosco, A. Silva, O. Loques, and C. L. de Amorim. A New Distributed JVM for Cluster Computing, 2003. [23] M. W. MacBeth, K. A. McGuigan, and P. J. Hatcher. Executing Java Threads in Parallel in a Distributed-Memory Environment. In CASCON 98: Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research, page 16. IBM Press, 1998. [24] M. J. Ming. JESSICA: Java-Enabled Single-System-Image Computing Architecture. Master s thesis, University of Hong Kong, 1999. [25] R. Quitadamo. The Issue of Strong Mobility: an Innovative Approach based on the IBM Jikes Research Virtual Machine. PhD thesis, University of Modena and Reggio Emilia, 2008. [26] R. Quitadamo, G. Cabri, and L. Leonardi. Mobile JikesRVM: A framework to support transparent Java thread migration. Sci. Comput. Program., 70(2-3):221 240, 2008. [27] J. Sinnamon, P. Strazdins, and J. Zigman. The Jikes Distributed Java Virtual Machine Manual. Australian National University, September 2003.

Bibliography 69 [28] R. Veldema, R. A. F. Bhoedjang, and H. E. Bal. Distributed Shared Memory Management for Java. In In Proc. sixth annual conference of the Advanced School for Computing and Imaging (ASCI 2000, pages 256 264, 2000. [29] Y. Yarom, K. Falkner, D. S. Munro, and H. Detmold. Mu-Objects - Efficient Separation of Application and Virtual Machine Object Spaces. School of Computer Science, University of Adelaide. [30] W. Yu and A. L. Cox. Java/DSM: A Platform for Heterogeneous Computing. Concurrency - Practice and Experience, 9(11):1213 1224, 1997. [31] M. Zenger. JavaParty - Transparent Remote Objects in Java. In In ACM 1997 Workshop on Java for Science and Engineering Computation, 1997. [32] Y. Zhou, L. Iftode, and K. Li. Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems. In OSDI 96: Proceedings of the second USENIX symposium on Operating systems design and implementation, pages 75 88, 1996. [33] W. Zhu, W. Fang, C. li Wang, and F. C. M. Lau. High-Performance Computing on Clusters: The Distributed JVM Approach. [34] W. Zhu, W. Fang, C.-L. Wang, and F. C. M. Lau. High-Performance Computing on Clusters: The Distributed JVM Approach, 2004. [35] W. Zhu, C.-L. Wang, and F. C. M. Lau. JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support. In IEEE Fourth International Conference on Cluster Computing, September 2002. [36] J. N. Zigman and R. Sankaranarayana. djvm - A distributed JVM on a Cluster. Technical report, Australian National University, 2002.