Runtime Locality Optimizations of Distributed Java Applications

Transcription

1 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing Runtime Locality Optimizations of Distributed Java Applications Christian Hütter, Thomas Moschny University of Karlsruhe {huetter, Abstract In distributed Java environments, locality of objects and threads is crucial for the performance of parallel applications. We introduce dynamic locality optimizations in the context of JavaParty, a programming and runtime environment for parallel Java applications. Until now, an optimal distribution of the individual objects of an application has to be found manually, which has several drawbacks. Based on a former static approach, we develop a dynamic methodology for automatic locality optimizations. By measuring processing and communication times of remote method calls at runtime, a placement strategy can be computed that maps each object of the distributed system to its optimal virtual machine. Objects then are migrated between the processing nodes in order to realize this placement strategy. We evaluate our approach by comparing the performance of two benchmark applications with manually distributed versions. It is shown that our approach is particularly suitable for dynamic applications where the optimal object distribution varies at runtime. 1. Introduction Java enables developers to express concurrency and to create parallel applications by means of threads. Performance gains over a sequential solution can only be expected if the virtual machine is executed on a system with several processors. JavaParty [10] extends Java by a distributed runtime environment that consists of several Java virtual machines. The virtual machines are executed on the nodes of a cluster of workstations. Each virtual machine has its own address space, but can perform remote method invocations on other virtual machines. Thus, JavaParty allows for performance gains through parallelism in a distributed environment. Solely distributing objects and threads over virtual machines is not sufficient for achieving performance gains. Since the placement of an object determines the processor of its methods, only methods of objects that reside on different machines can actually be executed in parallel. So we have two conflicting goals: On the one hand, groups of objects with frequent and expensive communication should be placed on the same node. On the other hand, objects should be distributed over the available processors to enable parallelism. Until now, JavaParty provides a mechanism to create remote objects on specific nodes of a cluster environment. The developer is responsible for distributing the individual objects and thus for distributing the activities to the processing nodes. Such a manual approach has several disadvantages. First, the object distribution is dependent on the specific topology for which the program is compiled. The distribution strategy must be adapted to each target platform. Second, manually specifying the location of every single object creation is tedious. Third, the optimal placement of objects often cannot be determined statically for dynamic applications where the optimal location of objects changes at runtime. The work at hand focuses on the automatic generation of a distribution strategy for remote objects. The generation is based on runtime information of the distributed system. Thus, the programmer does not have to worry about a proper object distribution and can focus on the solution of the problem. Even if the initial object distribution generated by JavaParty is not optimal, the locality of the application is optimized at runtime. In chapter 2 we give a brief overview of JavaParty. Chapter 3 discusses related work in the field of distributed Java applications. In Chapter 4 we describe the design of our approach and explain some basic concepts that are necessary for further understanding /08 $ IEEE DOI /PDP

2 Chapter 5 presents the implementation and discusses the problems we encountered. In chapter 6 we evaluate the effectiveness and efficiency of our work using two benchmark applications. Finally, chapter 7 concludes this paper. 2. JavaParty JavaParty extends Java by a pre-processor and a runtime environment for distributed parallel programming in workstation clusters. It transparently adds remote objects to Java whose methods can be invoked from remote virtual machines. Programmers can use the keyword remote to indicate that a class should be remotely accessible. Instances of remote classes are called remote objects, regardless on which virtual machine they reside. The runtime system offers a mechanism to migrate remote objects between machines. Java Remote Method Invocation (RMI) [14] permits the creation of classes whose instances can be accessed remotely from other JVMs. JavaParty uses RMI as target and thus inherits some of its advantages, e.g. distributed garbage collection. It uses a special pre-processor to generate pure Java source code that is consistent with the RMI requirements. This approach hides the increased program complexity due to RMI constraints as well as the additional code for creation and access of remote objects. JavaParty code is transformed into regular Java code plus RMI hooks. The resulting RMI portions are fed into the RMI compiler to generate stubs and skeletons. Since existing code might be using the original classes, handle objects are introduced that hide the RMI classes from the user. This approach maintains the Java object semantics such that the programmer can use remote objects just like normal Java objects. 3. Related work This section gives an overview of existing systems for distributed execution of Java applications. The goal of these systems is to gain increased computational power while preserving Java s parallel programming paradigm. In [3], distributed runtime systems are categorized into cluster-aware VMs, compiler-based DSM systems, and systems using standard JVMs. The first category consists of systems that use a non-standard JVM on each node to execute distributed applications. The most important examples of such systems are cjvm [2] and JESSICA2 [16]. Both approaches provide a complete single system image of a standard JVM. The advantage of using non-standard JVMs is increased efficiency due to the ability to access machine resources directly rather than through the JVM. A weakness of such systems is their lack of cross-platform compatibility. cjvm aims at virtualizing a cluster and at obtaining high performance for regular Java applications. A number of optimization techniques are used to address caching, locality of execution and object placement. The smart proxy mechanism of cjvm can be used as framework to implement different locality protocols. Currently, cjvm is unable to use a standard JIT compiler and does not implement a custom one. JESSICA2 applies transparent Java thread migration to multi-threaded Java applications. The migration mechanism allows distributing threads among cluster nodes at runtime. To support shared object access, a global object space has been implemented. The system includes some important features, e.g. load balancing through thread migration, an adaptive home-migration protocol, and a custom JIT compiler. Other systems compile the source or class files of a Java application into native machine code. Both Hyperion [1] and Jackal [15] support standard Java and do not change its programming paradigm. The usage of a custom source or byte code compiler has the disadvantage that such a compiler must continually be adapted to changes of the Java language specification. The advantage of compiler-based systems is their increased performance because of compiler optimizations and direct access to system resources. Hyperion offers an infrastructure for heterogeneous clusters providing the illusion of a single JVM. The original Java threads are mapped onto native system threads which are spread across the processing nodes to provide load balancing. The Java memory model is implemented by a DSM protocol, so the original semantics of the Java language is kept unchanged. To achieve portability, the Hyperion platform has been built on top of a portable runtime environment which supports various networks and communication interfaces. Jackal is a DSM system for Java which consists of an optimizing compiler and a runtime system. In combination with compiler optimizations, Jackal applies various runtime optimizations to increase locality and manage large data structures. The runtime system includes a distributed garbage collector and provides thread and object location transparency. While most systems use standard JVMs, only a few of them preserve the standard Java programming paradigm. Examples for such systems are JavaSymphony [4] and ADAJ [5]. Using standard 150

3 JVMs has the advantage that such systems can use heterogeneous nodes which locally optimize their performance using a JIT compiler. The main disadvantage of such systems is their relatively slow access to system resources. JavaSymphony is a programming environment for distributed and parallel computing that exploits heterogeneous resources. In order to use JavaSymphony efficiently, the programmer has to explicitly control data locality and load balancing. The structure of the computing resources has to be defined manually. Since all objects must be created, mapped, and freed explicitly, the handling of remote objects can be quite cumbersome. JavaSymphony does not offer assistance for those manual steps, so the semiautomatic distribution is likely to be error-prone. ADAJ is an environment for the development and execution of distributed Java applications. ADAJ is designed on top of JavaParty and is therefore most closely related to our work. The ADAJ project deals with placement and migration of Java objects. It automatically deploys parallel Java applications on a cluster of workstations by monitoring the application behavior. ADAJ contains a load-balancing mechanism that considers changes in the evolution of the application. While the focus of ADAJ is to balance the load between the individual JVMs, we concentrate on optimizing the locality of the distributed application. 4. Design 4.1. Locality optimizations Philippsen and Haumacher proposed locality optimizations in JavaParty by means of static type analysis [11]. They classify approaches to deal with locality in parallel object-oriented languages in three categories: (i) let the programmer specify placement and migration explicitly by means of annotations, (ii) static object distribution where the compiler tries to predict the best node for a new object, and (iii) dynamic object distribution based on a runtime system that keeps track of the call graph. JavaParty already provides mechanisms for manual object placement and migration, so we focus on static and dynamic object distribution in the following Static object distribution. Although a Java thread cannot migrate, the control flow (called activity in the following) can: when a method of a remote object is invoked, the activity conceptually leaves the JVM of the caller and is continued at the callee s JVM where it competes with other activities. Due to timeslicing and blocking, competing activities on one JVM decrease the total parallelism. Additional costs are introduced by the remote method invocation itself because of communication latency and bandwidth limitations. Thus, the general distribution strategy must be activity-centered: different activities should be placed onto different JVMs. Objects should be colocated to activities such that method invocation is local. Local method invocation avoids network communication and competing activities. Haumacher proposes an iterative procedure [6] to assign objects to activities and then activities to virtual machines. Based on a static type analysis, estimates for two values are derived: work(t, a) describes the computing time that activity t spends on methods of object a, and cost(t, a) describes the communication time that would be necessary if t and a were not located in the same address space. Through the placement of object a, the computing time of that activity t should be maximized in which address space a is created. At the same time, the sum of communication cost that is required for those activities t i assigned to remote virtual machines should be minimized. We assume an initial setting where all objects are located in a single address space with a single processor such that all method calls are local. In order to distribute objects to activities, we suppose that each activity is running in a different address space with its own processor. By placing object a in the address space of activity t, method calls of a by t can be executed parallel to other activities. Thus, work(t, a) indicates the time that is gained by the placement of a within the address space of t. The communication cost that other activities t i spend to access methods of a break even if work(t, a) is greater than the sum of cost(t i, a). So each object a can be mapped to an activity t in which address space it should be placed: activity(a) = t t maximizes(work(t,a) ti t cost(t,a)) Since usually more activities are used than virtual machines are available, several activities must share a virtual machine. Thus, it is necessary to identify groups of activities that should be executed on a shared virtual machine. The parallelization win of each activity can be estimated by mapping each object to its optimal activity. The parallelization win is computed by the sum of work(t, a) for objects a which reside in the address space of activity t minus the sum of cost(t, b) for objects b that are placed remotely: win(t) = {a activity(a) = t} work(t,a) {b activity(b) t} cost(t, b) i 151

4 The sum of work(t, a) represents the computing time that activity t spends in its own address space. This work is done in parallel to other activities if no synchronization mechanisms are used. The time that is spent for communication with other address spaces is represented by the sum of cost(t, b) for all objects b that are not assigned to activity t. Note that we charge the cost of a remote call to the activity that invoked the remote method, not to the activity that actually executes the method call. Activities are assigned to the available virtual machines in decreasing order of their parallelization wins until a single activity has been scheduled to each virtual machine. For each remaining activity, a new parallelization win is computed that accounts for the potential co-location with other activities. The activity is assigned to that group of activities with the highest combined parallelization win. This process is repeated until all activities are scheduled to their optimal virtual machine. The result of the distribution analysis is a mapping of each remote object to the virtual machine on which it should be placed Dynamic object distribution. While Philippsen and Haumacher focus on static object distribution through type analysis, we rely on dynamic object distribution to improve locality. This approach is reported to have two disadvantages: First, there is no knowledge about future call graphs as well as invocation frequencies. Second, the creation of objects that cannot migrate often results in a broad redistribution of other objects. The first problem is inherent to dynamic approaches, but can be softened by using heuristics to predict future behavior. The second problem is not exactly an issue in homogeneous cluster environments and can be handled by avoiding cyclic redistributions of remote objects. Besides these problems, the dynamic approach has the essential advantage that instead of estimating the values of work and cost, they can be measured: we take work as the actual execution time of a method call and cost as the communication time of a remote method invocation. As detailed later, we have to estimate the cost of remote calls that are actually executed locally because the called object resides on the same node. We adapt Haumacher s approach and use an iterative procedure to distribute objects to activities and then assign activities to virtual machines. Objects are migrated to the virtual machine their optimal activity is assigned to Time measurements Having developed a placement methodology for remote objects, we now focus on how to measure the time values required for the distribution algorithm. Beginning with the Pentium processor, Intel allows the programmer to access a time-stamp counter [8]. This counter keeps an accurate count of every cycle that occurs on the processor since it is incremented every clock cycle, starting with zero. To access the counter, programmers can use the RDTSC (read time-stamp counter) instruction. We use the counter to get an time estimate for the duration of method invocations. Note that the time-stamp counter measures cycles, not time. Thus, comparing cycle counts only makes sense on processors of the same speed like in a homogeneous cluster environment. To compare processors of different speeds, the cycle counts should be converted into time units. While the unit of time returned by currenttimemillis() is a millisecond, the granularity of the value depends on the underlying OS and may be larger. Thus, the timestamp counter also allows much finer measurements. To avoid measurement errors because of concurrency, we assume that the workstations of the cluster are used exclusively for JavaParty. In the presence of background jobs, cycle counting does not always reflect the real execution time of an application. But in the long run, the interrupts through background jobs are approximately the same for all workstations of a homogenous cluster. Thus, we assume that those interrupts balance over time such that cycle counting actually reflects the average execution time Remote Method Invocation RMI uses a standard mechanism for communicating with remote objects stubs and skeletons. A stub for a remote object acts as a local representative or proxy for the remote object. The stub hides the serialization of parameters and the network communication whereas the skeleton is responsible for dispatching the call to the actual remote object implementation. We want to measure work(t, a) and cost(t, a) in order to apply the distribution algorithm. In the context of stubs and skeletons, work corresponds to the time that the actual method implementation takes and cost corresponds to the time that is required for carrying out the remote call, i.e. marshaling and transmitting parameters and result. For remote object r, a stub is instantiated on each node while only one skeleton is instantiated on the node where the implementation of r resides. That is, 152

5 there are n stubs and one skeleton for each remote object. Basically, our approach is to measure the communication time of a remote call in the stub and the execution time of the implementation in the skeleton by using the RDTSC instruction. We store aggregated work and cost values in the skeleton. 5. Implementation 5.1. Time measurements Our framework for performance measuring wraps the RDTSC instruction described in the previous chapter using the Java Native Interface [13]. As detailed in Table 1, accessing the system time is orders of magnitude more expensive than using the RDTSC instruction. Times were measured on a Pentium III 800 MHz system. Table 1. Cost of System.currentTimeMillis() Call Cycles Time RDTSC.readccounter() µs System.currentTimeMillis() µs 5.2. KaRMI KaRMI [12] is a fast replacement for Java RMI. It is based on an efficient object serialization mechanism that replaces regular Java serialization. Since the remote method invocation protocol is different from Java RMI, the format of stubs and skeletons is different, too. The KaRMI compiler generates stub and skeleton classes from compiled remote classes. We modified the generation of stubs and skeletons to include code that measures the execution times of remote calls. The measured times are processed by the distribution task to compute an optimal object distribution. More precisely, we modified the generation of stubs to measure the total execution time of remote calls. Once a remote call returns, the stub sends the total time to the skeleton which measured the execution time of the actual implementation (i.e. work). Using both values, we compute cost as the difference between the total time and work. In order to transmit the total time from stub to skeleton, we added methods to send and receive the measured times to the client and server side of the connection. These methods are called after a remote method invocation has been completed and the result is marshaled back to the caller. Finally, the work and cost values are stored in the skeleton using a special data structure described later Estimation of cost An important optimization carried out by JavaParty is that a call is only executed remotely if the called object actually resides on another node. Otherwise, the call is executed locally. Recall that cost(t, a) estimates the communication time that would be necessary if activity t was not located on the same node as object a. While we re able to measure the actual communication time of remote calls, we have to estimate the cost of local calls as if they were remote. Thus, we have to develop a model to estimate the communication cost based on the measured cost of a local call. Whenever the client and server objects are in the same address space, arguments and result are cloned to preserve the copy semantics of a remote call. JavaParty produces a deep clone with all referenced objects also being cloned. In the generated stubs, the instrumented version of the local short cut measures the cost of cloning arguments and return value. The measuring can be divided into three parts: cloning of the arguments, local method invocation, and cloning of the result. Based on the measured local cost of cloning arguments and result, we estimate the communication cost if the call was remote. For this purpose we analyzed the results of a benchmark suite that measures the execution times of local and remote method calls for a representative set of parameter types. Given the duration of a local call, we estimate how long a remote call takes. While the absolute values are likely to vary on different machines, the relation between local and remote calls should approximately be the same. For simplicity, we assume a linear model with offset a and gradient b: remote cost = a + b (local cost) We applied a nonlinear least-squares algorithm to the results of the benchmark suite in order to fit the estimate function and determine the values of a and b Smoothing and storing time values We use a hash map to store time values, mapping activities to work and cost values. JavaParty assigns a globally unique thread id to activities that face remote calls. If a new measurement is to be stored, the given thread id is mapped to a pair of work and cost values. We store these values directly with the skeleton, so the addressed object is implicit. Since work and cost 153

6 indicate the computing and communication times an activity spends on all methods of an object, we have to aggregate the values of the individual methods in a reasonable way. We use an exponential moving average which has the following advantages over simply adding up the time values: First, the weighting for each data point decreases exponentially, giving more importance to recent observations while still not discarding older observations. Second, the weighting makes our measurement more robust against outliers, e.g. delayed execution because of distributed garbage-collection. Third, the exponential moving average is easy to compute and thus a relatively cheap operation Application monitoring JavaParty offers an interface that allows plugging in additional classes that can be used for monitoring the distributed environment. In our case, the monitor interface is implemented as an invisible task that collects runtime data based on instrumentation. This data is used to analyze the distribution of remote objects over the virtual machines. In JavaParty, references to remote objects are stored in a distributed fashion. Thus, we have to iterate over all virtual machines to obtain references to the remote objects. These references are used to collect the measured times. The monitor also serves as front end for the distribution task which can either be scheduled for repeated fixed-delay execution or invoked manually via a library call. Basically, our distribution task fetches the measured times and runs the distribution algorithm discussed in section 4.1. The distribution algorithm sorts the application threads according to their parallelization wins. Each activity is assigned to a group of activities which are optimally placed on the same virtual machine. Finally, each object is assigned to its optimal JVM and possibly migrated there. The migration succeeds only for objects that are not declared to be resident. If nothing was changed during the migration, the distribution task is canceled. 6. Evaluation In order to evaluate the effectiveness and efficiency of our work, we examined two applications that have potential for locality optimizations. If a program was already distributed optimally at compile time and its locality did not change during run time, there would be nothing to optimize. The first application is a numerical algorithm that has a static structure. We started with a sub-optimal distribution and optimized its locality during runtime. The second application is an n-body simulation with an inherently dynamic structure. We started with an optimal distribution and adapted the locality as the structure of the application changed. All measurements in this chapter have been conducted on our Carla cluster, using the Java Server VM 1.4.2_13-b06. This cluster consists of 16 nodes equipped with two Pentium III 800 MHz processors and 1 GB RAM each Successive over-relaxation Successive over-relaxation is a numerical algorithm for solving Laplace equations on a grid. The sequential implementation involves an outer loop for the iterations and two inner loops, each looping over the grid. During an iteration, the new value of each point of the grid is determined by calculating the average value of the four neighbor points. The algorithm terminates if no point of the grid has changed more than a certain threshold. The parallel implementation [9] provided by Maassen is based on a red-black ordering mechanism. The grid is partitioned among the available processors, each processor receiving a number of adjacent rows. Before a processor starts to update the points of a certain color, it exchanges the border rows of the opposite color with its neighbors. time [ms] x1000 grid, 300 iterations # machines manual optimized random Figure 1. Results of the SOR benchmark The SOR benchmark performs 300 iterations of successive over-relaxation on a 1000x1000 grid of double values. The performance was measured on 2, 4, 8, and 16 nodes and is reported in milliseconds per iteration. In order to evaluate our approach, we created three versions of the benchmark: (i) a manual version that creates all remote objects at their optimal location, (ii) a random version where the location of the remote 154

7 objects is determined randomly, and (iii) an optimized version which invokes the locality optimizations after the first iteration based on the random object distribution. The results of the SOR benchmark are shown in Figure 1. As expected, the manual version performs best with a constant speedup as the number of machines increases. The random version performs worst and does not scale with additional machines. Finally, the optimized version of the benchmark performs considerably better than the random version, improving its performance towards the optimal version. If more iterations were performed, the optimized version would do even better since the cost of the locality optimizations would bear less weight. Figure 1 might give the impression that the optimized version does not scale with additional machines. This is not exactly true since the cost of the locality optimizations is proportional to the number of nodes, too. Table 2 details the cost of the procedure for the SOR benchmark. Polling the remote objects clearly dominates the overall cost. In spite of its square complexity, the cost of the distribution algorithm is relatively small. Again, if the number of iterations was increased or a benchmark with longer processing times was used, the cost would decrease. The benchmark performs 10 iterations of n-body simulation with 1000 particles. The performance was measured on 2, 4, 8, and 16 nodes and is reported in seconds per iteration. Again, we created three versions of the benchmark: (i) a manual version with explicit placement annotations, (ii) a random version where the location of the remote objects is determined randomly, and (iii) an optimized version which invokes the locality optimizations after the first iteration based on the random object distribution. Figure 2 shows the results of the n-body benchmark. Because of the dynamic structure of the benchmark, an optimal distribution of the remote objects is hard to predict and depends on the spatial distribution of the particles. As the initial coordinates of the particles are determined randomly and thus are not known a priori, the manual version of the benchmark performs only slightly better than the random version. Since the locality of the application is adapted to the actual location of the particles, the optimized version of the benchmark performs best. The cost of the locality optimizations can easily be covered by the savings achieved during the following iterations. Table 2. Cost of the locality optimizations polling remote objects computing locality algorithm cost of migrating objects overall cost [ms] , , , ,30 time [s] particles, 10 iterations manual optimized random 6.3. N-body simulation # machines The n-body simulation approximates the movement of n particles in a two-dimensional space based on mutual gravitation. The simulation is discretized into time steps where the gravity between each of the n particles must be computed for each time step. Afterwards, acceleration and change in velocity and location are determined for each particle. In order to avoid the square complexity of computing forces, the present implementation uses an approximation proposed by Barnes and Hut. Through hierarchical grouping and generation of substitute masses for distant space regions, the computation complexity is reduced to O(n log(n)) operations per time step. We refer to [7] for a detailed description of the benchmark. Figure 2. Results of the n-body benchmark The n-body benchmark is a good example for the effectiveness of our approach. In dynamic settings such as the n-body simulation, it is hard and sometimes impossible to determine a good initial distribution of the remote objects. Even if an optimal distribution can be determined, the performance of the initial distribution will decrease since the locality of the application changes. Only a dynamic approach that optimizes the locality at runtime can guarantee consistently high performance throughout the whole life cycle of the application. 155

8 7. Conclusion and future work In this work, we presented runtime locality optimizations of distributed Java applications. Based on a static approach, we developed a dynamic methodology to automatically generate a distribution strategy for the objects of a distributed system. We instrumented stubs and skeletons to measure the execution time and communication cost of remote calls. The measured time values are stored locally to avoid communication overhead. The locality optimizations are implemented as a task that runs periodically or can be started on demand. This task collects the measured time values and computes an optimal distribution strategy. In order to realize the distribution strategy, objects are migrated between machines. We evaluated the effectiveness and efficiency of our work by optimizing two benchmark applications. The first benchmark is a typical example of a numerical algorithm with a static structure, so we created a random initial distribution of the objects and optimized their locality at runtime. The second benchmark has a dynamic structure, so that the performance of the initial object distribution even of an optimal one will deteriorate at runtime. We have shown that our approach is particularly suitable for such dynamic settings. In future work, we will focus on automatically adapting the periodic time of the distribution task such that it reflects the processing time of the application. If the structure of the application does not change, we might even want to switch off the measuring completely. For large clusters with thousands of processors or applications with a great number of objects, an algorithm with square complexity might be suboptimal. We could imagine a distributed algorithm that works with exact time values for only a couple of local nodes and extrapolates the values for remote nodes. References [1] G. Antoniu, L. Bouge, P. Hatcher, M. MacBeth, K. McGuigan, and R. Namyst, The Hyperion system: Compiling multithreaded Java bytecode for distributed execution, Parallel Computing, [2] Y. Aridor, M. Factor, and A. Teperman, cjvm: a single system image of a JVM on a cluster, Parallel Processing, 1999, pp [3] M. Factor, A. Schuster, and K. Shagin, A distributed runtime for Java: yesterday and today, Parallel and Distributed Processing Symposium, [4] T. Fahringer, JavaSymphony: a system for development of locality-oriented distributed and parallel Java applications, Cluster Computing, [5] V. Felea, R. Olejnik, and B. Toursel, ADAJ: a Java Distributed Environment for Easy Programming Design and Efficient Execution, Shedae Informaticae, UJ Press, Krakow, 2004, pp [6] B. Haumacher, Lokalitätsoptimierung durch statische Typanalyse in JavaParty, Diploma theses, Institute for Program Structures and Data Organization, University of Karlsruhe, January [7] B. Haumacher, Plattformunabhängige Umgebung für verteilt paralleles Rechnen mit Rechnerbündeln, PhD thesis, Institute for Program Structures and Data Organization, University of Karlsruhe, October [8] Intel Corp, Using the RDTSC Instruction for Performance Monitoring, SCPM1.HTM [9] J. Maassen and R.V. Nieuwpoort, Fast parallel Java, Master's thesis, Dept. of Computer Science, Vrije Universiteit, Amsterdam, August [10] M. Philippsen and M. Zenger, JavaParty - Transparent Remote Objects in Java, Concurrency: Practice and Experience, November [11] M. Philippsen and B. Haumacher, Locality optimization in JavaParty by means of static type analysis, Proc. Workshop on Java for High Performance Network Computing at EuroPar '98, Southhampton, September [12] M. Philippsen, B. Haumacher, and C. Nester, More Efficient Serialization and RMI for Java, Concurrency: Practice and Experience, John Wiley & Sons, Chichester, West Sussix, May 2000, pp [13] Sun Microsystems, Java Native Interface, [14] Sun Microsystems, Java Remote Method Invocation Specification, OC.html [15] R. Veldema, R. A. F. Bhoedjang, and H. E. Bal, Jackal, a compiler based implementation of java for clusters of workstations, Proceedings of PPoPP, [16] W. Zhu, C.-L. Wang, and F. C. M. Lau, JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support, IEEE Fourth International Conference on Cluster Computing, Chicago, USA, September