• No results found

A Java Founded LOIS-framework and the Message Passing Interface? - An Exploratory Case Study -

N/A
N/A
Protected

Academic year: 2021

Share "A Java Founded LOIS-framework and the Message Passing Interface? - An Exploratory Case Study -"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)

School of Mathematics and Systems Engineering Reports from MSI - Rapporter från MSI

A Java Founded LOIS-framework and the

Message Passing Interface?

- An Exploratory Case Study -

Master Thesis

Christian Strand

(2)

A Java Founded LOIS-framework and the Message Passing

Interface?

- An Exploratory Case Study -

Master Thesis

Christian Strand

(3)

Abstract:

In this thesis project we have successfully added an MPI extension layer to the LOIS framework. The framework defines an infrastructure for executing and connecting continuous stream processing applications. The MPI extension provides the same amount of stream based data as the framework’s original transport. We assert that an MPI-2 compatible implementation can be a candidate to extend the given framework with an adaptive and flexible communication sub-system. Adaptability is required since the communication subsystem has to be resilient to changes, either due to optimizations or system requirements.

Keywords: LOIS, MPI, Message Passing Interface and Java, MPI and Java, JNI, Java Native Interface, Component Systems

Acknowledgement

First of all I would like to thank my family for their support and their continual patience. Next, I would like to thank my supervisor, Professor Welf Löwe and the system administrator of the School of Mathematics and Systems Engineering.

(4)

Table of Contents

1 Introduction ... 2

1.1 Background and Motivation ... 2

1.2 Problem Statement... 2

1.3 Goal criteria ... 3

1.4 Methodology... 3

1.5 Limitations... 3

1.6 Outline ... 4

2 Technologies... 5

2.1 Message Passing and Computer systems ... 5

2.1.1 Cluster and Grid Computing ... 5

2.2 The LOIS Framework... 6

2.2.1 Functionality... 6

2.2.2 Infrastructure... 7

2.3 The LogP machine model... 7

2.4 The Java Native Interface ... 8

3 The Message Passing Interface ... 10

3.1 Fundamentals... 10

3.2 Errors and error handling... 11

3.3 MPI-2 extensions... 12

3.4 Interesting MPI implementations ... 13

3.4.1 Adaptive MPI... 13

3.4.2 Fault tolerant MPI... 15

3.4.3 Grid MPI ... 15

3.4.4 Java MPI ... 17

3.4.5 Conclusion ... 18

4 Design... 21

4.1 Development environment ... 21

4.2 Incorporating MPI into the framework... 22

4.3 Incorporation realization ... 22

5 Experiments ... 26

5.1 Start up... 27

5.2 Throughput and stability... 28

5.3 The LogP parameters... 29

5.3.1 Overhead ... 29

5.3.2 Gap ... 30

5.3.3 Latency ... 32

5.3.4 Error Discussion ... 32

5.3.5 Performance Discussion... 33

6 Conclusion ... 34

7 Bibliography ... 36

Appendix A: Measurement data ... 38

A.1 Start up... 38

A.2 Throughput ... 38

A.3 LogP-parameters... 38

A.3.1 Overhead ... 38

A.3.2 Gap... 41

A.3.3 Latency ... 43

(5)

1 Introduction

This chapter presents the background of the LOIS project and motivates the problem statement used in this thesis project.

1.1 Background and Motivation

Today it is possible to construct large arrays of geographically distributed heterogeneous sensors connected by a high-speed network. The objective behind these Wide Area Sensor Networks, WASN, is to obtain environmental data and to carry out scientific computations. Different research groups utilize such a WASN by submitting their applications. In order to keep up with the sensors’ continuous data streams, these applications have (ideally) to be designed to consume streams in real- time. Yet, from a usability perspective, the installation and development phases have to be simple enough. A Service Oriented Architecture, SOA, should provide a good approach to model these applications as services, but is not appropriate for implementing them as parallel applications due to poor performance and architectural mismatches. The reason for this is that the sensors produce continuous data streams while SOA architectures usually requests data. The DMDA, A Dynamic Service Architecture for Scientific Computing, outlined by [36] proposes a conceptual two- level WASN architecture combining a SOA-level and a heterogeneous data-driven physical level. The former level provides a programming model that abstracts internal critical system parameters such as scheduling and optimization. Once built, the system should optimally allocate supplied applications to processors (see 1.5 Limitations). The latter level provides adaptability by allowing changes to the machine model at run-time. Adaptability is required since the machine model hosting the supplied applications will change at runtime. Further, the deployment phase has to be simple since researches should not need to comprehend and efficiently manage all system parameters. Therefore, current research strives to provide graphical visualization of dynamic architectures in which designers can create architectures and execution scenarios.

As an early workbench, [5] has modularized an existing MFC (Microsoft Foundation Class Library) application with a predefined set of sensor dialogs. As defined in [36] a sensor network could be defined as a “set of sensors generating the input, a set of computational nodes transforming the input, and a set of interconnecting links transporting the data”. The framework defines independent components for continuous stream processing as well as an infrastructure for executing and connecting components. But abstracts a number of hardware and software specific properties of the DMDA such as scheduling etc.

1.2 Problem Statement

The aforementioned framework transmits IP datagram packages over the network through UDP, User Datagram Protocol. My task is to extend this framework with a message-passing layer based on the Message Passing Interface, MPI. The goal for this is twofold. First, will the MPI extension sufficiently fast deliver stream-based data? Second, is an MPI extension a possible candidate of an adaptable physical machine model?

(6)

1.3 Goal criteria

Developing a general, maintainable and flexible MPI extension that supports adaptability is the foremost goal criteria. Therefore the goal is not to re-implement the framework using MPI. The usage of MPI features is solely restricted to communication, not computation. The extension must permit any desirable execution scenario of required processors while still being resilient to changes. The generality implies portability and no feature shall be implemented making platform assumptions, unless the assumptions are likely to exist on other platforms. Thus we cannot assume a specific MPI implementation, or operating system.

1.4 Methodology

Besides delivering stream based data, the MPI extension has to be evaluated in terms of its capability to adapt to run-time evolution. In the former case, delivering stream based data is a bit special since there is no vendor implementation of the Message Passing Interface that supports a Java language binding. Usually, the language binding supported by the specification is C, C++ or Fortran. Therefore the first question raised in the problem statement naturally makes inquire about how to integrate an MPI communication layer into the Java founded framework. In the latter case, the capability to adapt to run-time evolution is a consequence of DMDA that requires that the physical communication subsystem is resilient to changes.

Especially since the scheduler or a researcher at will, may change the current execution scenario either due to optimization deduced by scheduler or when further applications is added by the researcher. With respect to this thesis project adaptively requires means to recover from failures in the Message Passing layer and on request, means to dynamically add or release applications during execution. Therefore the theory of this project investigates a number of potential MPI implementations that supports adaptivity, fault tolerance and support for a Java language binding.

As evaluation criteria it is interesting to establish the start up time required by the extension, the delay (latency) and also the extensions’ throughput. Analyzing the efficiency of parallel algorithms or the performance of parallel machines requires models that abstract the details. In essence, one desires models that are both realistic and simple. Providing both a realistic and a simple model entails defining a balance between details and simplicity in order to detect bottlenecks [15]. During the past, there have been various models that in some sense are unrealistic, impractical or tailored towards a specific network topology etc. Hence, in reality they side step the requirement of being realistic since they for instance assume that all processors work synchronously or that communication is free [15]. The LogP model on the other hand is a model being both realistic and simple and is often used in parallel computations.

Hence, the LogP parameters will be used to evaluate the MPI-extension. Once determined, the parameters constitute means to compare different candidates of an adaptable physical machine model and to eventually optimizing the runtime system.

1.5 Limitations

The limited time and scarce resources implies that exiting and different MPI implementations cannot be installed and evaluated. Section 3.4 Interesting MPI

(7)

implementations provides a rather theoretical view of interesting implementations followed by a (still theoretical) conclusion.

Once applications is deployed they should be scheduled by optimally allocate supplied applications to processors. This is possible, since the execution scenarios and dependencies defined by the researcher can be described by task graphs. A task is in this setting the computation defined by the researcher. Then the optimally allocation can be treated as an optimization problem of which the graph should be executed as often as possible. However, the goals of this thesis project do not include the scheduler or the algorithm behind and is therefore not presented here.

1.6 Outline

The rest of this thesis is organized as follows. Chapter 2 presents 2.2 The LOIS Framework and technologies required in this thesis project. Chapter 3 is dedicated to the fundamentals of the Message Passing Interface and Section 3.4 Interesting MPI implementations is dedicated to specific implementations providing adaptability and fault tolerance. Chapter 4 presents the design of the MPI-extension while Chapter 5 presents and interprets the experiments evaluated. Finally, Chapter 6 concludes this thesis project. For the interested reader, the Appendix A: Measurement data contains the data obtained by performing the experiments outlined in Chapter 5.

(8)

2 Technologies

As a perquisite to Chapter 3 The Message Passing Interface, this chapter introduces the fundamentals of message-passing and computer systems. This chapter does also present the LOIS framework and explain the LogP machine model. The chapter ends with a short explanation of the Java Native Interface.

2.1 Message Passing and Computer systems

Message-passing in the Single Program Single Data model, SPSD, of parallel programming generally refers to a set of cooperating processes communicating through message-passing. The processes execute the same program image on local data streams. Fundamental for this model is that all transmission of data is carried out collectively. A data transmission from one process to another process requires that the operation is carried out on both processes. Variations exist such as Multiple Program Multiple Data, MPMD, of which cooperating processes are executing different program images on local (or the same) data stream. Message passing is not the only way to transport data between processes. Data could also be moved via data- parallel languages such as High-Performance Fortran. Message-passing is according to [1] both expressive and suitable when processes are separately connected through a network. In particular, the expressiveness makes message-passing suitable for designing parallel algorithms such as instances of the design pattern Master- Slave. Message-passing does also potentially increase performance and data locality, since data explicitly could be associated with specific processes.

The reason for using a set of cooperating processes in the first place is to obtain an increased computational power. In order to obtain an increased computational power the SPMD model requires an (high-performance) computer system such as supercomputers, clusters or a grid system etc. Tough concrete computer system differ both in respects to hardware architecture as well the most appropriate model supported. Hence, clusters and grid technology are the most important technologies in this thesis. Since the message-passing interface can be used (with advantages) on a supercomputer, the above discussion will continue to hold. The next subsection provides a short introduction to clusters followed by grid computing.

2.1.1 Cluster and Grid Computing

One obtains a cluster when connecting possibly heterogeneous workstation (or networks thereof) together. Thus, compared to a supercomputer one obtains a cheap distributed virtual machine for parallel computing. However, by connecting heterogeneous workstation together one have to be aware of different binary formats, different data formats such as endians and different floating point representations.

Also heterogeneous machines may differ in their performance, may have different load (e.g., if the machine is not dedicated the computation) and so fourth. While the notion of a virtual machine refers to providing a single system view, Grid computing refers to on-demand sharing of distributed computers and resources for collective problem solving. More formally, the concept could be defined as “coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations” [9]. The shared resources are direct access to computers, software, data or sensors rather than objects or files. The individuals or institutions determining which resource and who is allowed to share resources are referred to as a virtual

(9)

organization, VO [9]. In order to avoid the recreation of an incompatible, non- interoperable distributed system Grid computing requires that the protocols are open, general and standardized [9]. Standardized protocols are required in order to allow for the (on-demand) establishment of sharing relationships between arbitrary organizations. Besides being standardized, the protocols should be open and general in order to provide interoperability among heterogeneous virtual organizations.

Global Grid Forum, GGF [8], has developed interfaces, common behaviors and fundamental semantics of Grid-based applications based on Web services (e.g., Grid Services) [10]. Current research endeavor to define and implement the Open Grid Services Architecture (OGSA) and Open Grid Service Infrastructure (OGSI). Tough Grid Services is outside the scope of this thesis; [13] has designed an interesting grid enabled MPI implementation. The implementation is grid enabled in the sense that it speaks Grid services protocols.

2.2 The LOIS Framework

As mentioned, the framework defines independent components for continuous stream processing as well as an infrastructure for executing and connecting components. The infrastructure is intended to be used together with a Mediator responsible for delegating data to an existing MFC C++ application SensorGui and registered components. The SensorGui is an existing (slightly modified) application with predefined sensor dialogs from what the framework has been reengineered and modularized of. Both the Mediator and the SensorGui are executed on the same site while registered components could execute either locally as threads or externally.

The data delegated by the Mediator consist of three antennas (channels) that constitute an LOIS antenna. The antenna data is conveyed and transmitted over the network in datagram packets through UDP, User Datagram Protocol. Each datagram packet (e.g. raw LOIS data portion) is numbered and contains three imaginary samples of each channel respectively. The framework contains the classes and structures required to decode and separate the channels.

2.2.1 Functionality

The components (or modules) building up the infrastructure defines general interfaces and encapsulates behavior required by all components. A component in the framework is defined as an independent unit capable to provide continuous stream processing. The framework differentiates processing components from components building up the infrastructure. The components building up the infrastructure defines general interfaces within LOIS intended to glue the independent components together. A processing component is intended to consume antenna streams in real- time. Thus it constantly performs some useful processing on the behalf of its supplier. To instantiate a processing component one has to parameterize these general components with the types required by the processing component itself (using Java generics).

(10)

2.2.2 Infrastructure

The middleware or infrastructure provides the jacket for executing and connecting independent components together. Referring to Figure 2.1, independent processing components are connected through the class Processor Jacket (jacket) and concrete instances of the interfaces Receiver, Sender and Data Converter. A processing component obtains data through the jacket. Prior delivery of data to the processing component the jacket itself obtains data from a Receiver and forwards the processed result from the processing component via a Sender. A Receiver makes use of a Data Converter in order to convert received data if it is required by the processing component, such as conversion of a datagram package into its equivalent software representation. A concrete Sender can delegate processed data to a set of concrete Receivers either locally through shared buffer references or over the network. Thus, Processor Component, Receiver, Jacket and Sender operates as one united whole.

The framework does also provide convenience classes designed for datagram transmissions of data such as the creation of endpoints and remote connections; for each endpoint there are means to define the number of retransmissions, timeout, buffer sizes etc.

2.3 The LogP machine model

The LogP model as initially defined by [15] models a distributed-memory multiprocessor of which P processors synchronizes and communicates through point- to-point messages. The number of processors is finite. A finite capacity network interconnects the processors. The model specifies the performance characteristics such as the maximum time or the maximum space required by a processor on the network. Note that the model does not pay attention to the network topology (i.e., all communication costs is equal no matter of the processors positions, size of the network and the network diameter). The processors have access to local memory, a clock and a network interface. Thus the processors are responsible for all send and receive operations. The performance characteristics are determined by the three parameters L, o and g (illustrated by Figure 2.2). The overhead o is defined as the

Figure 2.1. The infrastructure, the circle depicts the Processor Jacket coordinating the data flow indicated by arrows. The application corresponds to a specific Processing component.

(11)

amount of time required by a processor by sending or receiving a message. During this time period the processor is unable to carry out other operations. Assuming consecutive message transmission at a processor, the gap g is defined as the minimum time interval of either consecutive sends or receives. Thus it is a bound of the time required by the processor to send one and only one message into the network. If the gap is less than the overhead, the processor can use the difference to computation. The latency L, or delay, is defined as the difference of a send completion and the start of a receive operation. If the receive operation has started and the send operations is uncompleted (there are simply more bytes left to send) the latency parameter L will be negative. In reality this occurs when either large messages being sent or a too complicated protocol is used [17]. When determined, these parameters are meant to be interpreted as constants. As justified in Section 5.3 The LogP parameters, the parameters will in this thesis project be interpreted as functions of message sizes.

2.4 The Java Native Interface

The Java Native Interface, JNI supplies the means to integrate Java with code written in other languages. For example, a Java application can invoke C library functions not provided by the Java API itself. In this case, the C library function is compiled into host-specific binary code and linked through native libraries. The Java application use appropriate methods to load the C library function and invoke it as any regular method. The host-specific code removes the portability of Java applications and does also compromise the type-safety provided by the language itself. From the portability perspective; a Java application can no longer be ported, compiled and executed on any platform supporting Java. The compromised type- safety implies no longer the use of (safe) references. Instead all fields, methods, arguments etc are accessed and manipulated in native code through opaque objects.

Figure 2.2, the LogP parameters in consecutive message transmission (adapted from [26]).

(12)

These opaque objects are hosted by the machine itself and only accessible by calling appropriate functions through the interface pointer. This interface pointer, referred to as JNIEnv, is a pointer pointing to a function table including predefined functions.

The interface pointer is not enough when invoking native code from a Java application. The machine needs information concerning which object instance this particular method is invoked on (or which class in the case of a static method). The interface provides a mapping of regular Java data types to corresponding native types and a set of external languages types adhering to Java types (the interface does still differentiate primitive types from reference types such as strings, classes etc).

From the performance perspective, regular Java method invocations on object instances are faster than native invocations. This is because the machine has to carry out different housekeeping operations to build arguments, arrange the call-stack etc.

Estimates stated by [18] indicate a factor of two to three times greater. The same indication is applied for callback functions (e.g., native code invoking instance methods through the object reference). The cost of a field access is estimated as negligible because dereferencing a pointer through the interface pointer requires only a small amount of overhead. Note that the interface provides many more features to enhance performance (for instance caching), but they are not presented here (see [18]

for details).

(13)

3 The Message Passing Interface

The Message Passing Interface, MPI, is a specification of an application- programming interface defined by the Message Passing Forum [7]. MPI is not a language or an implementation by itself. The Forum was founded in order to design a practical, portable, efficient and flexible standard for message-passing. A standard for message-passing was required in order to create and execute parallel algorithms on supercomputers or clusters. The reason for a standard was due to tailored libraries adapted to vendors’ software and hardware in the early 90s. Besides portability, the standard facilitates efficient implementations of communication primitives possibly supported by hardware.

The first specification, MPI-1, provides a strictly message-passing model in which the number of processes is determined before starting an application. Recall that the message-passing model requires that all data movement operations are collective within the set of processes. Aspects such as parallel I/O, dynamic process creation and one-sided operations were deliberately omitted and introduced as extensions in MPI-2. The language bindings in the first specification were limited to C and Fortran 77, while the second specification also supports C++ and Fortran 90 modules.

Besides increased language bindings the second specification offers a number of features that makes MPI more robust and convenient to use [2].

3.1 Fundamentals

Fundamental to MPI are processes, groups, communicators and data types. Briefly, each process belongs to one ore more groups identified by an integer, referred to as the processes’ rank. After application start up there is an initial group containing all processes. An implementation provides numerous functions for group management.

Instead of referencing processes by an integer the specification also offers topologies.

Topologies are constructs that permit a logical connection of processes.

A communicator (or a distributed object) consists of a context for message matching and a group. The context provides the ability to make use of separately safe universes [6]; otherwise a specific implementation may mix up messages belonging to different universes, or force the programmer to cope with different universes. One makes use of the communicator (e.g., the programmer) in order to communicate with other processes in the group. There are two kinds of communicators available depending on the communication context: intra-communicators are used for operations within a single group of processes while inter-communicators are used for point-to-point communication between two groups of processes. Hence, inter- communicators are very important in a MPI-2 compatible implementation, since it abstracts the concept of two separately started groups of processes.

Recall that a cluster provides a single system view of processes and that processes communicates through message-passing. MPI provides many different type constructors to allow both for expressiveness and heterogeneity. In fact, it is the responsibility of an MPI implementation to arrange and adapt to different processor architectures, endians and so fourth. Thus MPI is a parallel middleware. For instance, derived data types allow one to treat contiguous, strided or indexed segments of arrays as individual elements. This is very useful, since one can send all the data out in one function call.

(14)

MPI provide different modes of point-to-point communication primitives between two processes as well as collective operations among a set of processes. The modes of point-to-point communication are standard, buffered, synchronous and ready. The modes differ with respect to be blocking or non-blocking (asynchronous) and when a send operation completes (which in turn is dependent on whether or not a matching receive operation is posted/required). A send operation in (standard) blocking mode completes (returns) when the message is safely stored in the matching receive buffer (or a local system buffer). An asynchronous send operation completes immediately and requires another call to check for completeness. Thus one or more data transfers and local computations can proceed concurrently. A buffered send operation can complete whether or not there is a matching receive, whereas a synchronous send completes only when there is a matching receive operation that has started receiving.

Finally, a ready send operation can only complete when the matching receive operation has been posted. Thus this operation can improve performance since an implementation is not obligated to verify a matching receive operation. Hence, if a matching receive is not posted the operation is erroneous. Further, blocking and non- blocking operation can be mixed.

The collective operations supported by an implementation consist of two kinds, data movement and collective computation operations. Data movement operations are useful for restructuring data among processes such as broadcast and different forms of scatter and gather operations. A collective computation like sum carries out a sum computation in all processes referenced by a communicator. Hence, an MPI implementation can from a performance perspective take advantage over its knowledge about each machine in order to optimize and increase the parallelism.

3.2 Errors and error handling

An MPI implementation has, according to the specification, to provide a reliable message-passing towards its applications. The application should not be required to detect and deal with faults and errors at the network level. If a fault should be unrecoverable it is delegated through error handlers or error codes (see below) to the application. Thus, the application is forced to take appropriate actions in case of unrecoverable errors.

Besides network level faults the application itself could cause errors. For instance by calling a function wrongfully by supplying an erroneous destination ranks or provides too small receive-buffers etc. A process could also exceed or abuse system resources such as exceeding buffers allocated for pending messages. If a fault occurs during a function call an error handler will be invoked. Which error handler being invoked depends on which communicator the call was performed on. To each communicator, MPI associates an either built in or user defined error handler. The built in are MPI_ERRORS_ARE_FATAL and MPI_ERRORS_RETURN respectively.

The former of these implies that if a fault occurs during the function call then all process shall be aborted while the latter one makes MPI functions to return error codes instead. Being able to write user defined error handlers is important in order to write fault tolerant applications, MPI-based libraries etc [19]. Naturally, there are a number of factors that limits the ability of an MPI implementation to return meaningful error codes. Some error codes cannot be detected, others are too expensive to detect during normal execution while others are catastrophic and occurs

(15)

internally of an implementation. Besides, an asynchronous operation may very well judge an operation as successful tough later cause an fault. Further, the standard says nothing about failing processes and as such provides no mechanism for recovering from failed processes. In fact, the specification does not define the state of an implementation after a fault has occurred.

3.3 MPI-2 extensions

As already stated, parallel I/O, dynamic process creation and one-sided operations were deliberately omitted and introduced as extension in MPI-2. The purpose of this section is to (briefly) introduce these concepts.

Threading

During the design of the specification there was not a clear thread interface. The forum was aware of threads and designed the specification to be thread safe, but does not require thread support. The MPI-2 extension on the other hand provides inquiry operations of the thread-safety level provided by an implementation (e.g., for portability). All threads in an implementation that supports threads are permitted to invoke MPI functions, but or not addressable to other processes. From the programmability perspective, the programmer is responsible to prevent race conditions of conflicting communication calls. A typical workaround is to make use of distinct communicators for each thread [2].

Dynamic Process Management

Dynamic Process Management allows the static communicator to grow, i.e. the forum added functions to create new processes during execution or to connect separately started MPI applications together. The management of dynamic processes has been implemented as collective operations in order to guarantee correctness and determinism. Creating new process during execution is carried out through collective operations that either creates new processes executing the same program image or different program images (e.g., SPMD and MPMD respectively). Telling where to launch processes and describing where program images is located is supplied through a special info argument. Note that the content and structure of the info argument is implementation specific. Tough, the semantics of some special key values such as the host and architecture has been defined by the specification.

Connecting two independent applications together is carried out through MPI functions similar to the client-server paradigm. Note that in MPI the client and the server are generally sets of processes. An implementation offers the mechanism to find the server but an application has to define its own logical connection point (e.g.

where to rendezvous). The client can locate the server for instance by a well-known port and address pair or through a name-server. However, many of these features are also implementation dependent and will not be presented here.

Parallel I/O

As defined in [2] parallel I/O is access to external devices such as files to a set of processes. Tough parallel I/O could in general be thought of as UNIX I/O utilities implemented with MPI functions [2]. That is, MPI supports operations (almost) similar to open, close, read, write etc. For instance, an MPI application can with this

(16)

extension implement a parallel copy of files. With respect to this project we see no important use of this extension and refer to [2] for details.

Remote Memory Operations

Remote Memory Operations refers to means to cause events such as signals or a remote memory copy to occur in the address space of another process. Yet without require that both processes carry out the same logical operation. Thus remote memory operation provides a mechanism similar to a shared-memory model. The specification defines three functions for writing, reading and updating a remote address space as well as a number of different synchronization primitives.

3.4 Interesting MPI implementations

Currently, there are many different implementations of the specification. Some are implemented as software packages for a range of platforms while others are implemented directly in hardware and as such platform specific. Others are designed and optimized for supercomputers or clusters. In case of supercomputers, vendors usually provide optimized implementations. Some are commercial while others are public available. The rest of this subsection introduces a number of implementations that are special in some respect to our extension.

3.4.1 Adaptive MPI

In an MPI-1 compatible implementation the communicator consist of the fixed set of processes allocated and created during start up. Since the communicator is fixed there is impossible to add more processes. Generally this static communicator implies that when a process fails the communicator fails and the application has to be aborted.

This static communicator might be tolerable on dedicated supercomputers with a predefined set of processes. A static communicator is not tolerable in grid environments or clusters where errors and faults are much more likely.

Adaptive-MPI

Adaptive-MPI, tough not available and rather high-level, provides many useful ideas.

It is stated as an adaptive MPI implementation designed with the ability to adapt itself to the unpredictable and dynamic nature of a Clustered-Grid environment [20].

With respect to this thesis project adaptively requires means to recover from process failures and on request, means to dynamically add or release processes during execution. Hence, the Adaptive-MPI implementation provides support for adaptivity.

Functionality

In order to support adaptation the implementation makes use of two special components: a watchdog and an extern resource manager. The resource manager is responsible for determine when a processor no longer is available and to locate new processors. If a processor is determined not available or if new processors become available the manager informs the watchdog through (predefined) events. How the resource manager determines availability is unspecified and determined by the application. Once notified, the watchdog informs the application of the event. The application in turn is responsible to detect the event as well as to take appropriate action such as distribution of application specific data [20]. However the

(17)

implementation will create new processes and update the communicator. For instance, adding or removing the rank of processes from the communicator and other internal data structures.

Performance

From the perspective of this thesis project, the performance measurements carried out by [20] indicate that the ability to (opportunistically) add or delete processors and processes during execution implies no significant overhead inside the system itself (see [20] for test environment and setup). This applies both to an already running system as well as when the implementation itself is required to start up their own implementation and updating internal structures such as the communicator. It should be noted that the update procedure could be expensive since it requires network communication. The same applies for the computation, it can be expensive to maintain and delegate data to new-started processes. Note that the latter of these is of course dependent on the computation.

AMPI

AMPI [21] is another adaptive MPI implementation designed by the Parallel Programming Laboratory at the University of Illinois. The adaptation supported by AMPI refers to maximize throughput in dynamic applications executed on heterogeneous clusters (with a static set of process [21]) rather then rebuilding internal structures as in adaptive-MPI. These dynamic applications are applications with irregular structure and dynamic load patterns due to a heterogeneous (clustered) environment. These shifts concerns background load from possibly other users, or when the processor is only available when the computer is not being used by its’

primary user, or when new faster processors are added etc.

Currently, the implementation makes use of a parallel C++ library Charm++ and AMPI in order to provide processor virtualization and communication optimizations.

Processor virtualization is an intelligent mapping of divided entities onto available processors [21]. This technique makes it possible to incorporate load-balancing techniques at the system level rather than account for it at the application-level.

Generally, incorporating load-balancing techniques at the application-level could be very cumbersome process [21]. Charm++ support dynamic load-balancing using object migration. Because objects operate on well-defined memory spaces they are easy to migrate. Because objects are easy to migrate they could be moved to a relative cheap cost [21]. Through the use of Charm++’s object model the run-time system can measure the load on particular objects instead of measure load on application-specific heuristics [21]. Besides measuring run-time load of objects, the run-time system takes attention to object-to-object communication patterns so that the communication impact of migrating particular objects can be established.

Since our extension explores the possibility to add an MPI communication layer we are assuming dedicated machines. Thus we do not expect background load due to secondary users. Although being outside the scope of this thesis project AMPI provides interesting migration and virtualization techniques worth exploring.

(18)

3.4.2 Fault tolerant MPI

Related to adaptability is fault tolerance. Again, the required level of fault tolerance is related to the underlying machine model (i.e., processes executing on dedicated reliable hardware versus a Grid-clustered environment). An application running on a Grid-clustered environment requires mechanisms for error detection as well as information (state) for restarting processes. In the best of worlds there are mechanism for detecting and surviving all kinds of faults. There exist implementations that automatically survive some kind of faults, while other notifies the application and the application itself take appropriate actions. Other implementations invalidate some though not all operations. Thus, the application itself arranges for the non-failing processes to retain enough information (e.g., states) held by the failed process for the overall computation to proceed.

MPICH-V and MPI-FT

MPICH-V (here abbreviated V) [22] is a research implementation that makes use of MPICH-1.2 (old implementation) to provide automatic fault tolerance through the use of complete check pointing and message logging. The implementation can re- establish aborted processes. The logging facility implies that there is no need to recreate computations in a checkpoint. Check pointing is a common technique in which a computation’s state periodically is saved on stable storage. These states are used to restart a failed computation. V is scalable but requires a reliable subsystem for check points and logging as well as for coordinating processes. The cost of this doubles the communication time but provides full recovery in all situations [19].

Since our use of MPI features is restricted to communication we see no use of this implementation and will not present it further. MPI-FT on the other hand provides a similar implementation as V for fault tolerance. The fault tolerance is provided by extending/modifying the semantics of MPI function calls. By extending the semantics of function calls the application is given the possibility to recover from failed processes. The application must by itself recreate and distribute data to crashed processes (the implementation re-spawns processes automatically). When a fault occurs the implementation notifies all processes belonging to the communicator about the event through error codes. According to MPI-FT’s implementers the implementation survives that all processes without one crash and because no logs or checkpoints are used the implementation provides no notable decreased performance.

As pointed out by [19], the specification itself provides relatively simple means to obtain some degree of fault tolerance. By considering fault tolerance as a property of an MPI application one can obtain the survivability usually provided by the server- client paradigm [19]. If the client goes down, the server can just stop communicating with it. By making use of an inter-communicator (and error handlers) one obtains two groups of processes (e.g., one group corresponds to the server while the other group as the client). Obviously it requires the use of at least three communicators, of which one is intended as the link between the two groups.

3.4.3 Grid MPI

MPICH-G2 (here abbreviated G2) [13] is a grid-enabled implementation based on the MPICH implementation. As G2’s predecessor MPICH-G it makes use of Globus

(19)

Toolkit (Globus) services in order to execute MPI applications. In particular, the Globus services provide user authentication and authorization, executable staging, process creation and monitoring in a heterogeneous grid environment. The implementation provides one single communicator that hide details concerning network topologies, different software systems and computer architectures. G2 provides the best possible form of communication by grouping processes according to their network level. For instance, processes residing on a supercomputer or a local cluster communicate through MPI primitives. While processes separated by a WAN communicate through Globus IO (TCP). Besides choosing the best form of communication the different groups implies the possibility of topology-aware collective operations and topology-discovery mechanism. Topology-discovery and topology-aware collective operations implies that (sub-) computations can be implemented in order to avoid expensive communication through different groups.

Group information can be obtained by querying communicators of predefined attributes.

Functionality

Before an application can be launched the user is required to obtain a proxy containing the user’s credentials. This is necessary because Globus makes use of a public-key infrastructure, PKI. G2 uses this proxy in order to perform user authentication on each (grid-) site the applications spans. Besides obtaining the proxy the user is required to specify a RSL (Resource Specification Language) script of each resource the application makes use of. The script identifies an computer, specifies requirements such as amount of memory, execution time, number of CPUs as well executable placement, environment variables etc.

After the user have obtained the proxy and written the scripts, a job description actually, the application is launched through the mpirun command. In brief, the scripts are used by DUROC, the Dynamically-Updated Request Online Coallocator, in order to allocate and schedule a job, see [13] for details. DUROC will generate a GRAM request for each computer and contact respectively GRAM server. The user is authenticated and the local scheduler is contacted in order to initiate the computation. GRAM, Grid Resource Allocation and Management is a protocol used for launching and the monitoring of a job. DUROC makes use of a barrier in order to coordinate the start-up process. This is possible since each GRAM server reports its process local state, for instance running. The monitoring of a process is carried out in a similar fashion.

When the processes have been started up, the job description is used to group processes after their network level, so called multilevel clustering. This process is executed by assigning each process a topology depth corresponding to its network level. (All processes belong to an initial communicator). For instance, processes communicating through TCP are assigned the topology depth 3 while processes that are communicating through a vendor MPI are assigned the depth 4 etc (see [13] for details). The groups are thereafter formed by assigning the same color to the processes with the same topology depth. A similar process could be used in a regular MPI application. The difference is that the process here is carried out through the implementation itself.

(20)

Performance

According to experiments carried out by [13], the G2 implementation provides roughly the same performance as other implementations when using a vendor implementation. The performance appears differently on a vendor MPI dependent in which context the function MPI_Recv (e.g., blocking receive) is called. The performance is dependent on if requests are outstanding, no pending requests, and if MPI_ANY_SOURCE is used. If there are no pending requests and all messages are sent directly between processes, the implementation uses minimal polling of TCP ports. Otherwise, the implementation is forced to make use of polling in order to delegate received messages to a process.

Issues

G2 is a research project designed to investigate how well MPI constructs are suitable for a grid environment. According to [13] there are a number of missing features such as dynamic process management and fault tolerance. The dynamic process management should imply a greater class of grid applications. Especially the mean to implement applications where the applications’ requirements and resource availability is dynamic, so called advanced reservations [13]. Note, that G2 implements the connect feature described in Section 3.3 MPI-2 extensions. A feature that allows connecting separately started MPI applications together.

3.4.4 Java MPI

With the success of the Java programming language efforts has been taken to realize a Java language binding of the MPI specification. In the past (the 1990s) there have been a number of initiated research projects that endeavored implementing a true object-oriented model of MPI or partial solutions considered natural to the Java language itself (for instance integration with Java threads etc) [14]. Today, most of these projects are either closed or has no official available implementation except for mpiJava. Though there are (or has been) other implementation we will only consider mpiJava since it has both documentation and an available implementation.

mpiJava

mpiJava 1.2.5 [14] does not by itself implement the MPI specification. Instead it provides an object-oriented MPI interface to a Java application. This interface makes use of a set of JNI (Java Native Interface) wrappers in order to invoke MPI functions.

The latest implementation makes use of (almost) any version 1.1 compatible MPI- implementation, for instance MPICH or LAM-MPI. The wrappers (or stubs) are loaded by the interface on behalf of an ordinary Java application (making MPI function calls). Since the Framework is implemented in Java this implementation constitute a very interesting possibility for incorporating MPI into the framework.

Functionality

The implementation does not require any special preparation before launching an application, other than the use of scripts. The above-mentioned wrappers are during launch loaded by the implementation itself. Thus from the perspective of an application the wrappers are transparent and MPI functions are invoked through

(21)

usual instance methods. Examples include the class Comm that represents an MPI communicator. The script referred to as prunjava is executed to launch applications and will call the underlying implementation’s proper start up mechanism. By supplying the number N of processes to run and the compiled class file X, prunjava will start up the N Java processes executing class X. Since a process in mpiJava constitutes an ordinary Java application the implementation supports both the SPMD and MPMD programming model.

The mpiJava implementers have restricted the use of Java threads since many version 1.1 compatible implementations of the specification is not thread safe [14]. In particular, it is not permitted to communicate between threads through MPI functions, nor is it allowed to concurrently invoke MPI functions in different threads.

Note that this does not imply that two threads or more cannot use MPI functions. It simply means that it is the programmer’s responsibility to guarantee mutual invocations.

One of the advantages of mpiJava is that java.lang.Object is a predefined data type. This implies that MPI function buffers could be an array of any object implementing the interface Serializable. Besides automatically serialization and de-serialization this implies very flexible and powerful derived data types.

Performance

The implementers of mpiJava have measured performance with respect to inter- processor communication of a number of (rather) old implementations. The performance measurement has been carried out through Ping-pong tests in which the one-trip time of increased message sizes are sent back and fourth. Thus the tests measure both one-sided latency and throughput (see [24] for details of test environment and setup). Briefly, both in a shared-memory and distributed memory realization they realized that mpiJava adds a constant overhead compared to regular MPI programs due to the performance of the JVM. With some implementations the overhead was rather large. As stated in their report, they do compare C programs and Java programs, which to some extend can explain the differences with respect to realized overhead.

Issues

The Java Virtual Machine, JVM, has complicated the implementation. In particular the implementers have documented problems concerning the interaction of the JVM and MPI, especially OS specific signals, mapping of Java types to native (e.g. C types) affecting the semantics of MPI calls etc. In the case of signals and signal handlers it may happen that MPI handles signals intended for the JVM and vice versa. These problems should be resolved in mpiJava 1.2.5 [14]. Further, the implementers own opinion is that interfacing Java and MPI through the use of JNI wrappers “is probably not a good approach”. This probably explains why we have not been able to find new information of mpiJava, neither in their homepage nor on the web dating after 2003.

3.4.5 Conclusion

Recall that our extension has to be judged from its capability to adapt to run-time evolution; the extension has to provide both adaptivity and flexibility. Therefore this

(22)

section concludes the discussion of interesting MPI implementations. Once more, we point out that we didn’t have the time and resources to evaluate the above implementations. Nor is not our list of implementations complete.

We assert that an MPI-2 compatible implementation can be used to extend the current framework with an adaptive and flexible communication sub-system. An MPI-1 compatible extension should not provide the required level of flexibility since the communicator in not permitted to change after run-time initialization. Thus adaptation to user or system needs is impossible. In particular, we see the use of dynamic process management, remote memory operations and threading as required and wanted aspects of which dynamic process management is the most important.

The threading possibility of an MPI-2 compatible implementation should provide the required design convenience to implement different roles as threads, such as specific threads dealing with incoming events as shutdown or reconfigure, updates of intended recipients and senders etc. The remote memory mechanism could be used to conveniently push asynchronously events such as the above mentioned. The support of dynamic process management would allow for a dynamic communicator in which separately started applications can be represented, while parallel I/O could be useful for transmitting possible data required by the new process to engage in the communication subsystem.

We consider the Adaptive-MPI implementation as possible to implement and evaluate through the use of MPI-2 extensions. In particular, the MPI-2 extension provides the necessary tools to implement something similar to the watch-dog and the resource manager. Recall that the resource manager is an extern component responsible for determining when a processor no longer is available. Thus a component like this could be implemented as the component responsible for monitoring and as the event dispatcher (tough assuming central control). Hence, we assume that monitoring is required both from the perspective of the communication system itself and from the perspective of the researcher. Since the communication sub-system has to have the ability to recover from potential deadlocks and the researcher has to be able to confirm progress of supplied applications. The simple fault-tolerance mechanism (Section 3.4.2 Fault tolerant MPI) could be used to detect failed processes but does not provide monitoring of processes. Without the use of the implementation itself, there is no longer a watchdog process responsible for dealing with events. Thus updates of the communicator have to be carried out through messaging, which we assume could be substantial loss of performance. Or put differently, with the use of an implementation like adaptive-MPI, it is not required to send a message to every recipient, since the implementation itself carries this out for us.

Through the concept of a virtual organization the G2 implementation should provide both authentication and authorization. Which in a full-fledged WASN is useful, but not with respect to this thesis project. Through the use of topology aware communicators we should have the possibility (tough, assuming a rather large networked system) to optimize communication and organize processes during start up. Further the G2 implementation implies no restrictions with respect to scalability.

In addition, this implementation supports monitoring of processes and thus also detection of failed processes. But any extension implemented using this implementation should not provide the ability to add applications as time go on.

(23)

Hence, the supplied applications should then be determined before starting up the system. Besides security, the same mechanism could be imitated by the support of the MPI-2 extension. As stated in Section 3.4.2 Fault tolerant MPI the grouping facility could be imitated through the use of separate communicators. Tough, in comparison they should not provide the same scalability as G2. The simple fault- tolerance mechanism should provide us with the ability to detect failed processes.

The tagging facility (when sending messages) could be used to define a protocol for updating recipients or senders. However, G2 is a research project evaluating the grid concepts and protocols, which in the end is required for interoperable problem solutions such as the implementation of a complete Wide Area Sensor Network.

Implementing the extensions on top of MPI-FT should extend the possibilities of adaptivity and performance since it provides automatic fault-tolerance and is MPI-2 compatible. Thus automatically give us the ability to detect failed processes through the return of error-codes. Thus in some sense comparable to the simple fault- tolerance mentioned above. This automat-iciness will also imply that the communicators are updated for us. Thus minimizing the required protocol (assuming that each processes learns the identity of the failed process, otherwise some process will send to a non-existing process) to inform every other potential process about the non-existing process. Note that the solution does still require a component similar to the watchdog above, responsible for monitoring and event-propagation.

As outlined above, many of the features such as automatic fault tolerance can be imitated by the use of the MPI-2 extension. Thus we consider a general adaptable solution as implement able provided that the underlying MPI-implementation supports the extension. Hence, the other solutions will only simplify aspects such as the required set of protocols and automatic update of the communicator. Yet, there are many unanswered questions such as expected performance degradation, the reliability of dynamic process management operations (which may be different in a Clustered-grid environment versus a supercomputer) etc. For instance, since the operations are collective (e.g. processes will synchronize before the computation proceed) we assume performance degradations. Likewise, implementing the monitoring facility and supporting event requires more than just observations of process-states in the scheduler. It requires a dedicated component in its own right.

Since we consider an overall solution implement able, we restrict the rest of this thesis to assess how to incorporate MPI into the current Java founded framework.

Thus, we will consider mpiJava as well, even though it does not implement any of the MPI-2 extension (but does provide a Java language binding).

(24)

4 Design

This chapter presents our development environment and the design of a number of potential concrete extensions.

The purpose of the suggested extension is to extend the framework’s infrastructure. In which MPI will be used as a way to transport data between processing components executing either on a supercomputer or a Clustered-Grid environment. Thus assuming either a shared memory based or a distributed memory based machine model. The applications, i.e., the Processing Components are still Java based. Recall from Section 2.2.2 Infrastructure that the Processor Jacket class and the interfaces Sender, Receiver and Data Converter define the frameworks’

component infrastructure. Referring to Figure 4.1, the extension implies to provide concrete implementations of the Sender, Receiver and Multiple Sender interfaces respectively.

4.1 Development environment

The development environment consists of eight homogeneous Sun OS UltraSPARC- IIi-processor (333 MHz 256 MB RAM) computers connected by a 100 MB Ethernet in the lab room hosted by the University, plus my own laptop computer for hosting the Mediator and SensorGui. The reason for the laptop computer is that the application SensorGui is Windows service based. For convenience we decided not to port this to our development environment. We choose the Local Area Multicomputer (LAM-MPI) implementation because it is already installed in our development environment.

The LAM (6.3.2) environment consists of the lamd process and the library libmpi on each machine constituting a universe. A LAM universe consists of all machines with the capability of executing MPI processes defined in a boot schema file (e.g., a list of machines). The lamd processes are launched with a

Figure 4.1, the extension component infrastructure. The dashed rectangles denote the new desired components (adapted from [5]).

(25)

three-step auxiliary program lamboot. First lamboot starts one process for each machine defined in the boot schema file. Second each machine allocates and reports back its assigned port. Third, all machines receives a list of machine to port mappings. Thus each lamd process knows the details of the universe.

An application is started by the auxiliary program mpirun. When launched, mpirun will pre-allocate all processes during initialization since all process has to have the ability to locate each other. During start up, processes inherits the process environment of its local lamd process respectively. The processor environment is fixed after invoking lamboot. The lamd processes themselves inherits the process environment from their start up shell. Thus a process executing locally inherits the user’s shell environment whereas a remote process inherits the environment of the remote shell program (typically rsh or ssh).

4.2 Incorporating MPI into the framework

The first problem outlined was how to incorporate MPI into the framework.

Incorporating MPI entails to follow a process model provided by the MPI specification. Our use of the LAM process model entails coordinated start up of programs adhering either to the SPMD or MPMD programming model. Recall from Section 2.1 Message Passing and Computer systems that a program conforming to the SPMD model executes the same program image on locally different data streams, while the MPMD model both permits different images and data streams. Thus we had to decide on to either use one process or two processes implementing the functionality of the Sender and Receiver interfaces respectively.

The LAM process model does also force us to either distinguish a Java process hosting the framework’s infrastructure from separately started LAM/MPI processes or to implement the extension entirely with mpiJava. Implementing the extension with mpiJava has the advantage of not requiring a separation of MPI processes from the framework since the framework itself is allowed to invoke MPI functions.

4.3 Incorporation realization

As stated above, we had to decide to either use on process or two processes implementing the functionality of the Sender and Receiver interfaces respectively.

Due to granularity, we decided to make use of different processes. Since there is no need for a processing component to have the ability to send data if it is only intended to receive. Further, since we are assuming any implementation of the MPI specification we have to be aware of the use of buffering. Internal buffers can increase the performance, but the amount of buffering capacity differs from implementation to implementation. Assuming general protocols, there is a possibility to run out of internal buffers since an implementation has to be able to match any message. Likewise, there is a potential risk of deadlock since two processes (e.g. the Sender and the Receiver) are blocked because buffer space is not available.

We had also to decide how to implement a connector. A specific connector implementation denotes the concrete communication mechanism required in order to delegate received/sent data between a Java thread and a LAM/MPI process. Referring to Figure 4.2, we realized that a socket connector and a shared memory based connector were possible and interesting. We wanted to explore a shared memory based connector for two reasons. First there exist both shared memory and

(26)

semaphore implementations (e.g., synchronization) completely implemented in main memory. Considering a completely memory based shared memory implementation as fast. Second, JNI provides a possible implementation approach worth exploring.

Tough we have stated that the extension should not rely on platform specific features unless these are likely to exist in other platforms. Hence, making use of JNI implies removing Java’s portability. Yet, we judge a shared memory implementation as one that is likely to exist in other platform other than ours.

Socket Connector

Implementing a Socket connector was straightforward. The Java Receiver operates as a server, i.e. listen for incoming data supplied from an LAM process receiving data through MPI function calls from either the Mediator or a Sender. The antenna data is either delegated through the use of TCP/IP or UDP. As stated above, the Framework and LAM processes are started separately. The server-client role for a Java Sender is the converse.

Shared Memory Connector

In order to implement a shared memory connector we wanted to make use of a shared memory segment implementation provided by System V (roughly a UNIX “flavor”).

In this particular connector implementation the framework’s Receiver or Sender loads a Dynamic Shared Library in order to communicate with a separately started LAM/MPI process. The shared memory segment itself is a circular bounded buffer, while (for instance) a Receiver operates as a consumer and the LAM process

Figure 4.2. Illustration of a separate Java process and two distinct LAM/MPI processes. The double-lined arrows denote a connector.

(27)

as a producer (instead of server-client roles as was the case for the socket connector).

Although being a rather straightforward solution this connector implementation was not possible to implement. Since the LAM/MPI processes were restricted to add a shared memory segment after process initialization of this (rather large required) size.

This restriction is due to the LAM process model. In particular, a processes started up by the lamd-daemon process inherits the process environment of this process (see 4.1 Development environment). Tough this might be possible to circumvent, we experienced the same problem when loading the library into the Java process. Hence, we considered this solution as non portable since it requires fine tuning of both LAM and the Java Virtual Machine properties closely related to a specific operating system. (It is possible to specify the size of memory segments prior process creation).

Since we wanted to explore a connector making use of JNI we decided to make use of a System V message queue instead. Which we implemented as the bounded buffer described above (e.g., a bounded buffer wrapping a message queue). Note that the kernel’s queue size is restricted to 4096 bytes. Thus the queue only provides space for maximum two LOIS packages at any given time. Which implies that either a Sender or a Receiver has to block more than is required with a larger queue (because of synchronization). Also note that the function calls performed through JNI has relative poor performance. However, from the perspective of a processing component this time penalty is negligible, (e.g., hidden in the Sender or Receiver thread, which is a common technique for hiding latencies of different kinds).

Message queues are not the only shared memory approach. There exist other shared memory approaches based on memory-mapped files that require disc accesses.

mpiJava

As stated above: making use of mpiJava does not require a separate runtime start up of the Framework and LAM/MPI processes. Since the framework itself is allowed to invoke MPI function through the interface. Further, the runtime configuration is still static since the underlying implementation is an MPI-1.1 compatible implementation.

As a consequence, we cannot use any of MPI-2 features since there are no wrappers implemented for these. Surprisingly, it was not possible to make us of mpiJava in our development environment. For some reason the start up script prunjava terminated all lamd processes. Actually, the script released the lamd process without their recreations. Since we were not able to locate the problem within a reasonable time limit we decided not to consider mpiJava. Note that we realized that the technique used by mpiJava is still applicable in our development environment. This is a technique not specific for mpiJava. In fact the start up mechanism of LAM is, according to LAM manual pages, general and allows for distributed start up of any program. Hence, we let the application schema file launch scripts instead of C executables. The application schema file is a job description intended for mpirun describing all processes to launch by the mpirun command. These scripts in turn launch the Framework’s Processing Components. In particular this solution made it possible to implement our own MPI-Wrappers.

MPI-Wrapper Connector

Before realizing LAM’s general start up mechanism we considered the Socket and Shared Memory connector (and variations thereof) as the only possible solutions.

(28)

Like the shared memory connector described above, this solution also make use of the JNI interface, but does not require a separation of the Framework’ Java process and LAM/MPI processes since the loaded libraries are permitted to call MPI functions. Because an MPI-1 compatible implementation (e.g., ours) does not allow more than one thread calling MPI functions we had to restrict the Wrapper. The Java class had to implement both the Sender and Receiver interfaces respectively. This would not be a problem in an MPI-2 compatible implementation. Thus it was impossible to use separate libraries for the receiving and sending roles. When running this solution we encountered another problem; after approximately 10-15 minutes the MPI runtime system aborts due to a segment violation (e.g., segmentation fault). Recall that a segment violation is a signal issued by the Operation System when a program makes use of a bad memory reference such as a NULL-pointer or a pointer pointing to an unallocated memory area. After making sure that the wrappers always check for bad references we considered the integration of Java and MPI as the most likely reason (after reconsidering the mpiJava documentation). Note, that the MPI runtime system aborts the application and not the virtual machine itself. Usually, the virtual machine itself throws exception like segmentation fault (tough named differently). In fact, the JVM makes use of signal handlers to handle fault like segmentation faults. For instance, if a call stack overflows then the JVM will take corrective action inside this handler [14]. Thus the MPI runtime system handles signals intended for the JVM (or possibly the reverse).

Recall from Section 3.4.4 Java MPI that mpiJava states that they have solved problems like these. In fact, it is a known problem by Sun (see [25]) and could be resolved through a process known as signal chaining. Briefly, signal chaining permits native language (e.g. the MPI library) to install their own signal handler and that the virtual machine will delegate (e.g., chain) the signal if it is not intended for itself.

Since the MPI standard (according to MPICH manual pages) requires that all signals used by an implementation should be documented. Thus this connector may be a feasible approach worth exploring for a specific MPI implementation. Since this thesis does not assume a specific implementation we will not carry out this process further. First all MPI implementations does not use the same signals (which in turn are tied to a specific operating system). Second it should require a thorough understanding of the signal handler(s) used by the JVM itself (which differs from platform to platform). Third, the synchronization and the collective features offered by the specification crashed the application immediately. Fourth, it is a great uncertainty whether or not the MPI signal handlers automatically will chain the signal as well or if this is required.

References

Related documents

If c is a field, variable, return type or parameter and is declared without having an explicit annotation and is an instance of class T when T is a flexible class, then

Table 4.3 contains the elapsed times for each Java client application that utilizes the PJI to communicate with the server, and Table 4.4 includes the elapsed times for each Java

In the mousePressed event of Sec table, the course ID has been got, and the students who got the course from Select table for the specific logon professor in the last term from

Therefore, for the desktop implementations, there are no differences observed in regards to the speed of execution of the benchmarks between the two languages, while there are

Delaval International AB, an attempt was made to investigate the effectiveness of different types of packaging solutions that can be adopted to improve the

Tímto způsobem dojde ke spuštění jedné instance, což je použitelné v případě, že chceme spouštět více různých aplikací. Příkladem mohou být master-slave

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Furthermore, table 7:6 summarises measures, performance objectives, strategic objectives, level of planning and their interrelations, which consequently will be a very useful