THIN, DISCONNECTED CLIENTS ON A HOSPITAL IT NETWORK

Transcription

1 THIN, DISCONNECTED CLIENTS ON A HOSPITAL IT NETWORK By Robert P. Bialek & Peter Brøndum Project No: SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE AT UNIVERSITY OF COPENHAGEN COPENHAGEN, DENMARK SEPTEMBER 2001 c Copyright by Robert Bialek & Peter Brødnum, 2001

2 Abstract An application that supports a work-flow in a hospital puts special requirements on the system: 1. The application must be reliable. We can not risk any loss of the data. 2. The application should preferably execute on a hand held computer. The hospital workforce is mobile. 3. The application should be available and function continuously even though the hand held computer is periodically disconnected from the network. There may be some areas without network coverage. 4. The application must be able to exchange the data with servers on the fixed network automatically and transparently for users. There may be no time to go and synchronize the data. Creating a system that fulfills all these requirements is the challenge in this thesis. The mentioned requirements are shared by many other types of environments other than the hospital environment. Consequently, we see a use for a general system that supports creation of reliable applications for periodically disconnected hand held computers. In this thesis, we have clarified the concepts of how to build a client-server support system that combines the support for disconnections and support for reliability in one hand held device. We have performed a bottom-up analysis of the issues that arise when building such a system. We have also designed and implemented a prototype of a general client-server support system. The client-server support systems guarantees delivery of requests and responses. The system is able to sustain crashes and recover its state. The system is also able to handle varying degrees of connectivity that stretch from connected, intermittently connected to disconnected computers. We have demonstrated the general client-server system by building a hospital work-flow application using it. The application is a browser and the work-flow tasks are expressed in forms generated by a server. The server cooperates with a hospital IT system created by Radiometer A/S. ii

3 Acknowledgements This work was done as a part of the project apparater.dk. We would like to thank our advisor professor Eric Jul from the Department of Computer Science at the University of Copenhagen for his good advices, general support and motivating attitude. We would also like to thank Radiometer Medical A/S for an interesting project and a general support. We would especially like to thank our contact person at Radiometer Medical A/S, Tommy Andreasen, for good support around Rime. Finally, we thank Symbol Technologies Inc. for test equipment and a good service. iii

4 iv UNIVERSITY OF COPENHAGEN Date: September 3, 2001 Authors: Title: Robert Pawel Bialek & Peter Brøndum Thin, Disconnected Clients on a Hospital IT Network Project No: Degree: Supervisor: Department: Master of Science Professor Eric Jul Computer Science Department (DIKU) Convocation: September 2001 Permission is herewith granted to University of Copenhagen to circulate and to have copied for non-commercial purposes, at its discretion, the above title upon the request of individuals or institutions. Signature of the author Signature of the author Signature of the supervisor

5 Contents 1 Introduction Problem formulation Goals Our approach System overview Demarcation Structure of the report Background for the thesis The hospital application Work process using a central repository Work flow using distribution Work flow modifications Implications of disconnections Existing IT system General requirements for hospital systems Technology restrictions Processing and battery limitations of hand held computers Limitations of hand held computers due to network connectivity Requirements to the system support Structure of the analysis 22 4 Client server models Basic concepts Simple request/reply scheme Analysis of client server layers Communication layer Application support layer Choosing client server model structure and interface Object oriented client server model v

6 CONTENTS vi Internet type model Choosing client server model structure and interface - conclusion Summary - Client server models Client server system for hand held computers Processing and memory reduction Device-specific adaptation Placement of adaption logic General technics for adaption Identifying client capabilities Transformation of server response Summary - Device-specific adaptation Network limitations Network adaptation problems General design Adaptation methods Summary - Network adaptation Summary - Client server system for hand held computers Client server model with disconnections Important issues concerning disconnections Analysis of overall approach Changing low level protocol Asynchronous client server system Client-agent-server system General issues when using local server data and logic Mobile object model Client-proxy-server Our approach Summary - Overall approach Analyzing the proxy Support for complex servers Support for simple object servers Requirements for downloaded objects Transformation between request/response and messages Encoding of request Cache coherency Management of cache Handling outdated requests Finding the agent Asynchronous interface to application One-way requests MOM primitives used Summary - Proxy

7 CONTENTS vii 6.4 Analyzing the agent Analysis of a simple agent Analysis of a complex agent Analyzing the server Support for mobile objects Advantages of the client-agent-server model Analysis of MOM communication layer Methods for exchanging messages between MOM-queues Handling network disconnection Analysis of browser application Awareness of disconnection Browser sessions Application level cache Control of parameters in system Summary - Client server model with disconnections Reliable client server systems Fault model for the client server system Level of reliability Containment strategy Strategy for guaranteed delivery of request/reponses Strategy for guaranteed delivery of request/reponses Summary - Fault model for the client-server system Handling errors in modules Saving state necessary for recovery Location of persistent state Summary - Handling errors in modules Handling errors between modules Transactional protocol Summary - Handling errors between modules Recovering from system faults Maintaining module functionality Restarting modules Summary - Recovering from errors Summary - Reliable client server systems Application supporting a work-flow process in a hospital Overall approach Analysis of a work-flow process Analysis of task execution in an IT system Work flow management systems Integration with Radiometers existing IT-system Conclusion on application

8 CONTENTS viii 9 Analysis Conclusion Choice of client-server model Implications when using hand held computers Implication when client is periodically disconnected Implications of reliability on a client-server system A hospital application using a disconnected system Design General overview of the design Communication layer design Flexibility of Communication layer Addressing messages Automation of message transportation Reliability of communication layer Summary - Communication layer design Application support layer design Connection to communication layer Separating application interface from the request reply manager Servicing requests during disconnection using cache Application agent Summary - Application support layer design Application layer design Interacting with user Managing underlying layers Controlling application logic Summary - Application layer design Implementation Choice of programming model/language Unimplemented functions System presentation Hand held side Communication layer on the fixed network side Application support on the fixed network side API Application on the fixed network side Source code from others System start up Starting fixed network system Starting palm application

9 CONTENTS ix 12 Test General test setup Hand held computer Wireless AP Stationary computer Network Software Performance Test General time distribution Request Time Time in local MOM Time of transport in MOM Time of parsing and displaying Servicing from cache Disconnections Transporting message to MOM Transporting message from MOM Disconnecting during transportation Reliability test Crash before synchronizing Crash after synchronizing Crash during the synchronization Related Work Object oriented systems Rover CORBA Data-servers Coda Bayou Oracle Mobile Agents and Oracle Lite Thin client systems WebExpress Gate-way solutions W Citrix/VNC/PCanywhere Conclusion Main problems Project goals The system Proxy Agent system Disconnected thin client application

10 CONTENTS x Portability Performance Future of the system A Appendix i A.1 XML i A.2 Rime interface ii A.3 URL addressing ii A.4 Design of the internet type model in Java iii A.5 Components in asynchronous systems iii A.6 MOMS v A.6.1 MOM systems in general v A.6.2 Types of MOM systems vi A.6.3 The Java Message Service (JMS) vii A.6.4 JMS in detail vii B Reliability methods ix B.1 Computer system faults in general ix B.2 General fault tolerance technics x B.2.1 Containment xi B.2.2 Masking xii B.2.3 Recovery xiii

11 List of Figures 1.1 Overview of the system The network environment for the off-line system Structure of the report Client-server applications cooperating Model of a simple client server system Client server models Adaptation methods General model Intercepting proxy-agent design A simple background server Message-oriented-middleware system Asynchronous client server system based on Message Oriented Middleware Client-agent-server design Local objects model with lazy synchronization with server objects Client-proxy-server model Our client-proxy-agent-server model Client-proxy-agent-server model with thin client applications Reliable client server model Process pair of watchdogs Serialized activities Merging activities Dependencies between activities Dependencies between activities Transporting activities in the system Design of the system on client and server side Design of the communication layer using MOM, with the JMS interface xi

12 LIST OF FIGURES xii 10.3 Design of the communication layer, that automatically transports messages between queues Application support layer Dividing application support Design of the server Design of application Logic (server side) Application on the client side Object and thread relationships Communication layer s objects and threads on the server side Server objects and threads The implemented system The round trip of a request Request times for different requests Request times with variable polling time Request times with variable synchronization time The relation between response message size and time of fetching the message The service time of the request on the fixed network side The relation between message size and the parsing time Request s service time from the cache A.1 Message Oriented Middleware v

13 Chapter 1 Introduction In this thesis, we will present a system that supports building reliable applications for periodically disconnected hand held computers. The system is a middleware system that can be used for building client server applications. The contribution of the system is that it combines support for disconnections and support for reliability in one general client-server support system made dedicated for small hand held computers. To our knowledge, this approach is new. The system has been built systematically bottom-up in order to take into account general limitations of hand held computers, support for disconnections and reliability issues. The system has been designed in a layered structure to ensure an open and flexible design. The system is able to handle varying degrees of connectivity that stretch from connected, intermittently connected to disconnected computers. The system is reliable, so even after a program crash the data are not lost and it offers delivery guarantee of requests and responses. Communication specific issues have been abstracted to a message-oriented-middleware layer. In addition to a general client-server support system, the contribution of this thesis is a hospital work-flow application that demonstrates the general system. The application is a thin client application expressed in XML-forms. The forms are generated by a server and executed in browser that can be disconnected. The application works with an existing hospital IT-system designed by Radiometer Medical A/S. We have used the requirements for the work flow application to deduct requirements for a general client-server support system. Because the application was designed for practical use in a hospital environment, there is a special epenthesis on reliability. We believe that support for reliable applications for hand held computers with periodic disconnections is generally useful and even needed in practical systems. 1

14 1.1 PROBLEM FORMULATION 2 We have implemented the most part of the system ourselves. The system has been implemented in Java and tested on the Palm platform. The thesis includes analysis, design and implementation of the system as well as a short analysis of the hospital application using the general system. We include the test results showing the main functionality of the system. The thesis will also contain treatment of the theory that is necessary to understand the system. 1.1 Problem formulation This thesis was done as a part of project apparater.dk. Radiometer Medical A/S is a participant in apparater.dk. Radiometer Medical A/S manufactures and sells measurement equipment to hospitals. The company saw a use for hand held computers to support work processes in a hospital in connection to their apparats. For example a work process could be: The hand held computers could be used by doctors to request a measurement. A nurse could use a hand held computer to fetch the requisition. When taking the blood sample for the measurement, the nurse could register important information like ID of the blood sample and patient temperature. The sample could then be transported to an apparatus (for example by pneumatic transport system) and measured. The measured data could then be send from the apparat to a central server and matched with the other data. Finally, the doctor could receive the measurement results on his hand held computer. The hand held computers in this system should be able to function disconnected. Many hand held computers are only equipped with a serial cable or Infra Red (IR) network connection so an continuous connection is very difficult. Hand held computers with wireless LAN might lose connection if the computer is carried outside of coverage. Additionally, temporary network or server application faults might result in disconnections. Support for disconnections can thus ensure availability of the system in the presence of faults. Radiometer s interest in this project was: 1. To get a documented prototype of a system that could support the above mentioned functionality. 2. To get inspiration on a theoretical level on how to make systems involving hand held computers and, in particular, how to support disconnections.

15 1.2 GOALS 3 We found the problem to be very interesting from a theoretical and practical point of view. What was needed, was a system that supported disconnected work and was designed specially for hand held computers. The hospital environment demanded special attention to reliability and to the varying capabilities of users. The application logic demanded that data created by a number of disconnected sources should be matched. The system could be build as a client server system. Our interest in the project (and DIKU s) has been to solve an interesting practical problem in a general way. To our knowledge it is a new approach to combine support for disconnections and support for reliability into one general client-server support system made dedicated for small hand held computers. Thus combining the interests of Radiometer and us, the problem in this thesis has been to analyze and design a reliable client-server system that supports disconnected work and executes on hand held computers with limited resources. The description of this analysis can be used as a source of inspiration on a theoretical level for Radiometer. In addition, using the general client sever support system, we will build a prototype for an application that support the hospital work process that Radiometer needs. The prototype will be integrated in Radiometer s existing hospital IT-system. We use this application to demonstrate the general system. The problem can be formulated in 4 subproblems: 1. How can we handle the general limitations of hand held computers in the client server support system? 2. How can we handle disconnections in the client server support system when hand held computers are involved? 3. How can we support reliability in the client server system when hand held computers are involved? 4. How we can design the work flow application that uses the system for disconnected work? 1.2 Goals In this section we present the goals of this thesis. We focus on both the theoretical and practical side of the project. Our goals with the project are: 1. To make an literature search of each of the four subproblems that were mentioned in problem formulation.

16 1.3 OUR APPROACH 4 2. To analyze the four problems and find the solutions for each of them, so they can be combined resulting in a system dedicated for hand held computers. 3. To design an open system that can be easily expanded, where system modules can be changed, and where using the system is not difficult. 4. To implement and document a prototype that demonstrates the system. This application is a proof-of-principle for the realism in our client server system. The application supports a work flow process in a hospital. The prototype should work with Radiometer s existing Hospital system. 5. To perform the test of the major parts of the system. The test should show the functionality of the system and its properties to sustain disconnections, crashes and limited capabilities of the devices. 1.3 Our approach We start by analyzing the general requirements for our client server system. We do this by looking at the hospital work-flow process, the general requirements for hospital applications and basic technology restrictions. We use these requirements to deduct realistic requirements for a system support layer. Then, we make a short analysis of the client server model. This analysis will introduce to the main concepts in the thesis. In order to find the solution for our problem and complete our goals, we search theories about the three first subproblems mentioned in problem formulation. For each subproblem we will start by creating simple models of our system. Performing analysis of the problems we will then successively improve and refine our model of the system. After every analysis, we will make a conclusion. The conclusion will be used to sharpen the later analyzes. After having completed the analysis of the general client server, we analyze an realistic application using it. We finally present the complete system design, our implementation of the system and a test of the implementation.

17 1.4 SYSTEM OVERVIEW System overview In this section we present an overview of our system. We believe this early introduction to the system will help the reader to understand the analysis in the thesis. Figure 1.1: Overview of the system. We have divided the system in three vertical layers: Communication layer, Application support layer and application layer. The system is also divided in two horizontal layers: The fixed network side and the hand held computer side. As shown on figure 1.1, the model describes two computers cooperating about a task: a hand held client and a server. The server is placed on a fixed network. The hand held computer connects to the fixed net through an access point (AP) that can be either a PC with serial cable, an Ir AP, or a wireless LAN AP. The hand held computer can be moved from AP to AP. The system has four layers that have different responsibilities: 1. Communication layer The communication layer consist of the protocol stack used to communicate between the two computers. The responsibility of this layer is to move bytes between the computers. The communication layer offers asynchronous communication primitives. This implies that using communication layer is not dependent on the network connectivity, since the control is returned immediately to the caller. In this way the communication layer abstracts all communication issues from layers using it.

18 1.4 SYSTEM OVERVIEW 6 The communication layer is built as a message oriented middleware (MOM) that guarantees message delivery. All communication is transported in messages. Messages are kept in a persistent queue until they can be transported to the target queue. When messages are received, they are kept in the queue, until they are fetched from it by the target process. 2. Application support layer The responsibility of the application support layer is to offer a client server interface to the application layer. Application support layer offers asynchronous invocations and supports disconnected work. We use an internet type interface (For example the http-protocol using GET,PUT, POST requests). The layer offers delivery guarantee for requests and responses. The application support layer uses the communication layer to transport data. The application support layer translates requests into messages and reconstructs replies from messages sent by server. During disconnections, the proxy services request from the cache and queues requests that can be not serviced. The proxy offers a simple write back functionality. Application logic takes care of writes i.e. it generates an update log. The proxy only updates a cache and sends the update log to the server. In an object model, the update logic can be included in the objects themselves. On the fixed network side, the application layer uses an agent that executes the requests against the server, which usually communicates using synchronous primitives. The agent translates server replies into messages. The agent s main responsibility in our model is to transform between an asynchronous model and a synchronous model. However, the agent can be extended to execute more complex tasks on behalf of the hand held computer. 3. Thin client support layer. This layer is actually a part of the application layer that we will describe later. However, because of its importance be will describe this sub-layer first. This layer consists of a browser and a form generator. The layer supports thin client applications, where the browser is responsible for displaying form-objects downloaded with the client server system. The browser is also responsible for issuing requests defined by links in the browser objects. In addition the browser is responsible for interacting with the user. The browser uses the application support layer. Downloaded objects like form-applications are cached and requests issued while the hand held computer is disconnected can be

19 1.4 SYSTEM OVERVIEW 7 queued. In a special page the browser can view pending requests and access responses when they return. The form generator is responsible for creating forms that can be displayed by the browser. The forms are generated specifically for the hand held computer platform that is used. 4. Application layer. The application layer is the layer including the final application logic. The application layer can use any of the three mentioned layers directly. Depending on the used layer, the programming paradigm changes: (a) Thin client support layer When using the thin client support layer, the client side of the application is expressed in forms. Most of the application logic is executed on the server. The thin client support layer is itself an application. This layer supports thin client programming paradigm 1. (b) Application support layer When using the application support layer directly by applications (Not shown in figure) both the client and the server access the application support layer, and perform asynchronous request-reply communication. This layer offers a general client server programming paradigm with an internet type interface and support for disconnections. (c) Communication layer When applications use the MOM communication layer directly (Not shown in figure), both the client and the server access the communication layer and communicate through messages. Using communication layer directly offers a very flexible programming model namely, the message passing paradigm. We built the specific prototype application for Radiometer using the Thin Client Support Layer. The application consists of a number of form-applications and a server that coordinates the work-flow and communicates with Radiometer s existing hospital IT-system. The model is simplified. We have only shown one server and one client. Notice that some of the functionalities that are placed in application support on the fixed network, might be abstracted to autonomous processes that run 1 Saying thin client programming paradigm, we mean the browser type programming

20 1.5 DEMARCATION 8 on other computers. In addition, we have not shown in detail how reliability is supported and we have generally not indicated any implementation details 2. In the analysis we will develop the system through successive improvement and refinement. The final system presented in the design section (see section 10) will be more complex than the system described here. However, the system described should give a useful general picture. 1.5 Demarcation In this section we present the boundaries of our project and describe relevant areas that we will not analyze. Figure 1.2: The network environment for the off-line system On figure 1.2 we present a model of the environment that our system should function in. The model is structured in a number of mobile nodes and a fixed network with a number of stationary nodes. The mobile nodes are hand held computers that access the fixed network through access points (AP). The AP can be wireless AP 3 or serial AP 4. They can access servers on both the local network and on the internet. We bind our project to an analysis 2 The modules could in principle be implemented as either libraries, threads or processes 3 Wireless LAN AP, infrared AP 4 Cradle with direct or indirect access to the fixed network

21 1.6 STRUCTURE OF THE REPORT 9 of disconnection, impacts of a processor weak computers, and impacts of reliability requirements. We do not focus on roaming or routing problems connected to disconnections and high mobility of hand held computers. We will not focus on specific protocols used when communicating over different physical media. We are interested in a general system. In general, we do not make an analysis of disconnected workflow support systems using hand-held computers. We are not making a CSCW analysis of hospital work processes in general. However, we will analyze how the specific hospital workflow application can synchronize data when the data is generated by many disconnected sources. We will not discuss security aspects. 1.6 Structure of the report Figure 1.3: Structure of the report As shown on figure 1.3, the report is structured in 10 chapters. The ar-

22 1.6 STRUCTURE OF THE REPORT 10 rows indicate the order of reading the chapters that is preferable to get the understanding of the later chapters. The chapters are: 1. Introduction. The introduction is this chapter. This chapter is generally important but we especially recommend that some time is used to read section 1.4. In the system overview, we present briefly the system which we analyze. Knowing the system will help in understanding the analysis. 2. Background for the thesis. In this chapter, we present the two important impacts on our project: the technology and the hospital requirements. This section will present now-a-days hand held devices and their limitations. In this chapter, we will also present a work-flow process in a hospital. 3. Structure of the analysis. In this chapter, we present the overall structure of the analysis. We will briefly argument for why we perform the analyzes and what results we get from them. 4. Client-server models This is a general analysis of the client server models that will result in choosing the internet type client-server model as the interface for our system. This chapter is a general chapter that presents the background knowledge. The most important part of the chapter is section 4.4 Choosing client server model structure and interface. 5. Client-server system for hand held computers In this chapter, we discuss how we can handle the general limitations of hand held computers due to processor and memory, device screen and size, and due to the network connectivity. The chapter results in choosing a browser type application which uses a client-proxy-agentserver design. This chapter is important to understand choices, which we made in the later analyzes. This analysis assumes the internet type client-server model from the previews chapter. 6. Client-server model with disconnections This is the most important chapter in the analysis. In this chapter, we analyze the different solutions of how to handle disconnections. We

23 1.6 STRUCTURE OF THE REPORT 11 propose a system built on asynchronous communication primitives. The system uses the client-proxy-agent-server design. In this chapter, we perform analyzes of a proxy, an agent, and a server in relation to disconnections. We also address the issue of awareness of disconnections. 7. Reliable client-server systems. This is the second most important chapter in the project. In this chapter, we analyze the method that will ensure delivery of requests and responses. We analyze methods that ensure reliability and can be used in a system with disconnections. In this chapter, we introduce how we can assure reliability on different levels in the system. We will show that we can guarantee delivery of requests and replies even after a system crash. In section 7.2 Handling errors in modules, we will present how we can keep the system state and recover after a crash. In section 7.3 Handling errors between the modules, we will present methods for reliable delivery of data between the modules. 8. Application supporting a work-flow process in a hospital. In this chapter, we discuss problems related to the application supporting a work-flow in a hospital. The application addresses a general problem, namely, the matching of data produced by multiple disconnected sources. We analyze a method for supporting parallel and disconnected work. We introduce methods for handling the synchronization problem that arises because of disconnections. Additionally, in this chapter, we analyze how our system can be combined with Radiometer s existing IT system. 9. Analysis conclusion This is important chapter summarizing the previews analyzes. In this chapter, we present the whole system as a combination of the solutions on the problem from the sub-analyzes. On the basis of this chapter we build the system design. 10. Design Design is practically important chapter that presents how the analyzed theories can be combined in a whole system. 11. Implementation

24 1.6 STRUCTURE OF THE REPORT 12 For readers interested in working on the prototype, we present the most important implementation choices and solutions. We also present the overall implementation and object structure in the system. 12. Test This chapter includes both description and results of the tests. 13. Related work This chapter is important to understand the difference between our system and other related systems. 14. Conclusion This is the main conclusion on the project.

25 Chapter 2 Background for the thesis In this chapter, we describe the motivation for our project in details. It is a requirement analysis on a general level. The general system that we develop in this thesis has certain characteristics: It is made especially for hand held computers, it is designed to be reliable and it is designed to support disconnections. 1 The hospital application that demonstrates the use of the general system shows how applications distributed on multiple periodically disconnected hand held computers can cooperate. In the following, we will analyze why we need the described characteristics. We construct the system from bottom-up but the requirements were deducted from top down. The requirements for the hospital application determine the requirements to the underlying system support, but the general system is useful beyond the scope of the particular application we discuss here. We start by describing the requirements for the application by looking at a hospital work process that we want to support. We look at Radiometer s existing IT-systems. Then, we look at general requirements in hospital environments. Finally, we have some requirements that are the result of technology restrictions. We show that support for periodic disconnections is necessary, and that the system should be reliable. On the basis of the requirements to the total system, we deduct the requirements to the system support. 1 With hand held computer, we mean a small device that can be held in a hand. It has often limited screen, processor, and memory. It is equipped with network connection (a serial port, infra red wireless port, ethernet port, USB, or wireless LAN) 13

26 2.1 THE HOSPITAL APPLICATION The hospital application Radiometer Medical A/S is part of the project apparater.dk. Radiometer Medical A/S manufactures and sells measurement equipment to hospitals. The company saw a use for hand held computers to support a work process on a Hospital in connection with their apparats. In the following sections, we will describe: 1. The work process that we want to support in our application. 2. The existing hospital IT system. 3. General requirements in hospital environments. 4. The requirements for the application and the system support Work process using a central repository The apparats that Radiometer sells are used for measuring a number of special parameters in blood (blood gasses, metabolites and electrolytes). These apparats are a part of a work process in a hospital.nowadays, the process is performed manually using paper notes, and heavy work distributing mechanisms. 2 The work process generally starts by a doctor making a requisition for the measurement of some or all of the parameters that the apparatus can measure. The requisition could be made on a hand held computer and communicated to a central repository 3. The person that effectuates the requisition is called an operator. The operator is typically a nurse or a laboratory technician. The operator receives the requisition from the central repository using a hand held computer. A blood sample is drown from the patient. The operator has previously verified that the patient s ID on the requisition match the real patient s ID (for example by bar code scanning ). A unique ID for the blood container (often syringe) is registered (for example by bar-code scan) along with notes about the patient that may affects the interpretation of the results 4. Then, the requisition is linked to a particular blood sample ID, and this information 2 Every information has to be logged on the paper, and given to the hospital personal. The documentation is needed due to patient history, insurance and accounting 3 Alternatively, the doctor could communicate the requisition directly to a hand held computer belonging to the person that are to effectuate requisition. We will look at this solution in the next section 4 The notes can be: Arterial or venous blood. Is the patient given oxygen and in what percentage? What is the patient temperature? Is the patient fasting? How will the blood sample be transported(temperature)?

27 2.1 THE HOSPITAL APPLICATION 15 is communicated to the central repository. The blood sample is transported to the measurement apparat (sometimes in a pneumatic tube system). Generally, the apparats will be connected to the fixed network, but we can also see periodically disconnected apparats. 5. The ID of the blood sample is registered by the apparats (for example by bar-code scan). The apparats does one of two things now: If it is connected, it can contact the central repository, fetch the requisition from the central repository and only measure the required parameters on the blood sample. Alternatively, it can just measure all parameters on the blood sample. The measurement data is linked to the blood sample ID and sent to the central repository. The requisition, the blood sample data registered at blood sampling, and the measurements are matched in the central repository. A doctor can now request, view, and accept the measurement data on a hand held computer. It is clear that the work process described here closely resembles the work process involving a doctor wanting any measurement on a patient. Use of hand held computers is not restricted to Radiometer apparatus but can be used generally in a hospital Work flow using distribution An alternative to the above mentioned work process could be to communicate directly instead of using a central repository. The doctor communicates the requisition directly to the operators hand held computer (for example infrared beaming). The operator adds notes concerning the conditions of the patient and transfers this information to the blood container (for example using a chip on the container or by encoding the information on a label). The blood container is transported to the apparat. The apparat reads the information and measures the parameters wanted. The total information is sent to the doctor requesting the measurements in the first place. Because it is not practical to transfer all the requisition information to the blood container, this solution is not realistic. In addition, requisition and measurement information in general has to be communicated to the central Hospital information system (HIS) for accounting and history. device 5 Either because of server/network breakdown or because the apparat is a hand held

28 2.1 THE HOSPITAL APPLICATION 16 The only step where direct communication might be used is between doctor and operator. In conclusion, the work process should involve a central repository Work flow modifications The work process described in section is an idealized work process. In practice, this work flow can not always be followed. Often a blood sample is drawn and measured first and paper work is done later. The doctor might want to view the measurement data as quickly as possible and later fill out the requisition form. In a realistic application that supports this hospital work process, it should be possible to start the activities in different order. It is evident that we can not measure a blood sample before it is drawn from the patient. In addition the blood sample is only drawn because a doctor has ordered this. However, these are physical actions that are distinct from the documentation activities in the application. In the ideal case these documentation activities are performed at the same time as the physical actions Implications of disconnections For reasons that we will describe later, periodic disconnections of hand held computers can not be totally avoided. This implies that we can not guarantee that the information is communicated instantly from the hand held computers to a central repository. Disconnection might result in transmission delays. This means that we can not assume that the information that links the requisition and the blood sample is communicated to the central repository at the time of measurement. This in turn means that the link information must either be on the blood container (for example a bar-code label with the requisition number placed on the blood container) or the apparatus must just measure all the parameters it can on the blood sample. We have already rejected to encode the information on the blood container and we will assume that the apparatus just measures all parameters it can if the requisition information is not yet available 6 (Notice that we assume that there is enough information on the blood container to make it clear that the blood sample should be measured on a Radiometer apparatus and not another type of apparatus). This means that some modules must filter the measurement data and only show a view on data that corresponds to the requisition. 6 This is actually not an unreasonable assumption because the apparatus often measures all parameters anyway. However, in order to save blood volume the apparatus can use special modes that restrict the number of parameters

29 2.1 THE HOSPITAL APPLICATION 17 Since the measurement data may arrive to the central repository before the requisition or data linking the requisition ID to the blood container ID, we need some kind of merge procedure. In the simple work flow which we described so far, the data can be matched using unique ids 7. In principle, there is no problem in allowing users to take and measure blood samples before filling out a requisition. We just need some unique knowledge (for example blood-sample ID) in order to match the data. In chapter 8, we will analyze the application supporting the described work process. In particular, we will look at how we can math the results from concurrent processes Existing IT system In addition to apparats, Radiometer Medical A/S sells a data management system that collects the measurement data from the apparats. The system is constructed of a central data-manager, and some clients that offer different functionalities. In this context, we only need to discuss the central manager called Rime ( Radiometer Instrument Management Engine). Each apparatus can contact Rime and deposit measurement data (Blood measurement data and calibration data). Rime also contains demographic data like basic patient information and personnel information. Requisitions 8 can also be registered in Rime. Rime is implemented as a CORBA system and offers CORBA interfaces. 9 A hospital usually has a central information system (HIS). Requisitions and measurements should be deposited in this system. It is planed that Rime should interface with this system but at this point the functionality has not yet been implemented fully by Radiometer. In our project, we will assume that any access to the HIS is handled by Rime. Consequently, we will not concern us with the specific way Rime communicates with the HIS. We will only focus on the logic necessary to support the mentioned work process General requirements for hospital systems There are some general requirements that apply for hospital systems. 7 requisition ID, patient ID and blood container ID 8 also called test-orders 9 We will not go into detail with the Rime interface because we consider this an implementation detail. In addition we do not want to reveal more details about RIME than strictly necessary. The most important issues concerning RIME will be presented in appendix A.2

30 2.2 TECHNOLOGY RESTRICTIONS 18 Faults might result in injury or death of a patient, hence a very high degree of reliability is generally required. Loss of data is generally unacceptable. Building an application that is reliable is therefore essential. In addition, hospital systems should have a high degree of availability. The application that we build should be accessible at any time. The user should be able to use the application continuously even though the network breaks down or the hand held computer looses its connection for a period of time. Hospital personnel can not be expected to have technical knowledge about computers. The user interface of our application should be simple and most problems must be handled transparently to the user. This includes aspects as handling faults and updating software. Generally, data used by the application must be under central administrative control. 2.2 Technology restrictions In addition to the requirements for the application due to the work process on a hospital, the existing IT-systems, and the hospital environment, we have some requirements that are the results of technology restrictions. We want to use hand held computers but now-a-days hand held computers have a number of limitations in comparison to stationary computers. These limitations have to be taken into account when designing systems for these computers. Solutions that are valid for workstations may not be useful on small hand held (or wearable) computers. We discuss the limitations of hand held computers in two sections: 1. Processing and battery limitations of hand held computers. 2. Networking limitations of hand held computers Processing and battery limitations of hand held computers Today s hand held computers are characterized by orders of magnitude lower processing capabilities 10 and memory 11 compared to stationary computers. Some systems only supports a single process for applications. The Palm OS does not at this point have capability for multitasking (Palm s Developer Guide). Multitasking can be simulated by using a java virtual machine (ex KVM offers multithreading). Windows CE, on the other hand, is designed to 10 PalmV: 15MHz, Ipaq CE 256 Mhz, PC 1GHz 11 PalmV 2MB, CE-machines typically 32MB, PC typically 256MB RAM + 30GB Hard disc

31 2.2 TECHNOLOGY RESTRICTIONS 19 run multiple programs and tasks simultaneously. Multitasking is beneficial for back-ground synchronization. Even though the hand held computers have relatively limited processing resources there is a rapid development in the field. Some of the devices are already very powerful. Compaq recently released a multimedia hand held computer, ipaq, that can show video and animation. Siemens expressed plans of integrating a mobile phone within the hand held computer. Nokia announced expanding their mobile phone with Java VM, that will make it into a hand held computer (that is equipped with wireless network connection). When designing a system for hand held computers we should therefore try to consider the future development. But still, the hand held devices are weaker than personal computers. Furthermore, today s hand held computers have serious energy restrictions, because of limited battery lifetime. The energy restrictions make it necessary to carefully consider, how energy is used, and to minimize this use. Therefore, we still tend to push as much of precessing as possible to the servers (on the fixed network) Limitations of hand held computers due to network connectivity The networking resources on hand held computers are often very limited. Stationary computers are normally continuously connected to the network through a reliable, high-bandwidth, fixed network connection 12. To be useful, hand held computers should be used while walking around. The mobility of hand held devices reduces either the bandwidth or availability of the network. In the following sections we will discuss the network connections available to hand held computers now and in the foreseeable future. As we shall show, these network connections have generally lower bandwidth, are less reliable and in effect have higher demands that the hand held computer is able to operate disconnected. When discussing network connections available to the hand held computers, we will group the connections according the coverage of the access points. Low coverage Serial cable 13, infra red connections 14, and bluetooth 15 are characterized by low coverage. The connection between hand held computer and the network 12 Today the standard is 100Mb/s Ethernet 13 Typically 115kb/s 14 ex. IrDA (Data transmission from 9600 b/s-115 kb/s in asynchronous serial IR to a maximum speed of up to 4 Mb/s (Synchornous 4PPM) 15 1Mb/s

32 2.2 TECHNOLOGY RESTRICTIONS 20 goes through an access point (AP) that is reachable within short distance. Given that, the user should be able to walk around and still continue working. It can be concluded that the computer should be able to operate disconnected for long periods of time (at least while walking from one access point to another). If the work includes a server, the changes made while disconnected have to be synchronized with the server when an access point is available again. Compared to a typical fixed network connections like ethernet, the connections have much lower bandwidth. High coverage Wireless LAN 16 and mobile phone 17 access point have high coverage. This means that an access point is typically available at any location where a user might stand. Most systems permit roaming between access points. As a consequence, the user can walk around while still being connected to the network. In these systems in theory, the user can connect on demand and be connected for a prolonged period of time. It should be noted, however, that coverage might not be perfect in praxis. The user might walk out of coverage, or interference can occur from other users or devices. In these situations a connection might be lost (time-out) or the user might not be able to connect to the network on demand. As a consequence, disconnections for short periods of time can be expected. In addition, the user might not want to be connected to the network for long periods of time in order to save battery (or money). Like any other network, it may break down for a period of time. For mission critical applications, the user should be able to continue using the application anyway. In conclusion, even though a network connection is available, it is likely that the user will want or will be forced to operate disconnected, for shorter or longer periods of time. Wireless LAN and mobile phone modems have orders of magnitude lower bandwidth than fixed networks. Though, it should be noted, that wireless LAN s bandwidth are quite high (2Mb/s - 11Mb/s). New standards 16 Wireless LAN b offer transmission rates of 2-11 Mb/s 17 The theoretical maximum speeds of GPRS is up to kb/s using all eight timeslots at the same time without any error protection. Transmission rates are much lower in reality. Relatively high mobile data speeds may not be available to individual mobile users until Enhanced Data rates for GSM Evolution (EDGE) or Universal Mobile Telephone System (UMTS, also called 3GSM) are introduced

33 2.3 REQUIREMENTS TO THE SYSTEM SUPPORT 21 for mobile phones like UMTS are pushing for higher bandwidth in mobile phones too, so bandwidth probably will not be a limiting factor for many applications in near future. It is more a question of availability. 2.3 Requirements to the system support In the preceding sections, we discussed the requirements that are placed on the application because of the work process, the existing IT-systems, the hospital environment, and technology restrictions. In this section, we present the requirements to the system that are set by the particular hospital application that we analyze. The application itself is not the focus of this thesis, but the application is a realistic example of an application executing on hand held computers. We think that the demands that this particular application sets are general requirements for many types of practical applications using hand held computers. We especially think that the demand for disconnected operation will often arise when using hand held computers. In addition, we will generally want to offload work to servers. When hand held computers are used in commercial systems, we can generally expect that a high degree of reliability is expected by the user. We can not expect the user to be especially technically competent or interested. Updates with new software versions should, therefore, be a part of the system. Generally, we can also expect that data should be under central administration. Hand held computers used in the field are more susceptible to loss or theft than stationary computers under central administrative control. This means that we can generally not trust clients on hand held computers. As a consequence, we see a general need for a system that supports reliable and disconnected operation on hand held computers. We also think that there is a clear indication that a client server model is a good approach for these type of systems. We especially think that the requirements indicates that the system should support constructing applications as thin client applications, that executes in a browser type application on the client computer. In the analysis, we use these visions as a starting point.

34 Chapter 3 Structure of the analysis In this chapter, we discuss how to make a system that has the functionality presented in chapter 2. The analysis is structured in five sections that address the problems of: hand held computer limitations, disconnections, reliability, and synchronization of work in a disconnected system. We analyze the following issues: 1. Client server system as a programming model. In section 4, we analyze client server models, which are the basis for our system. We start with analysis which clarifies the main concepts. The analysis answers why we split our model into layers. argument why we use an internet type client-server model. 2. Client server system for hand held computers. We also Because we want to make a system for hand held computers, we analyze how to support a client-server system with limited resources. In section 5, we analyze how to support an internet type client server model using hand held devices. We analyze how we can handle the processing, display and network limitations of hand held computers. In this analysis, we present the argumentation for using a thin client model based on a browser as a way to handle processing limitations. This model also solves the software version problem mentioned in section We introduce a thin client layer on top of our general client server model. In this chapter, we also discuss technics for adapting to specific devices when many different types of hand held computers are used. We discuss how display limitations can be handles in this context. 22

35 23 Finally, we discuss different technics for handling network limitations of the hand held computers. We show that a model that includes a local proxy and an agent on the fixed network opens the possibility for using many different technics for reducing bandwidth consumption and latency. We save the discussion of disconnections for section Client server systems with disconnections. In section 6, we analyze the implications of disconnections in the client server model. In this analysis, we discuss the solutions for adapting to disconnections. We argument for a simple general local proxy solution that is limited to a cache, and a client-server system that offers asynchronous invocation through the use of an agent. We reject the use of a local server and locally replicated server data on the hand held computer. We analyze the implications of this choice. In the section, we also discuss how different kinds of agents can offer advanced support for disconnections. We shortly discuss whether or not applications using the system should be aware of disconnections, and how the awareness can be handled. In the section, we argument for separating the system in four layers: application layer, thin client application layer, application support layer and communication layer. We show that abstracting the communication layer not only gives us flexibility to adapt to different protocols and types of network connections but also makes it possible to use different synchronization methods (for example hotsync, airdiscs [Jing 1999], direct TCP/IP connections). 4. Reliable client server system. In order to make a reliable system that can be used in a hospital environment, we analyze all fault types that are critical for our system model. We decide what level of reliability we want and which technics we use. We also analyze how to handle the errors and how to overcome them. In chapter 7, we perform the analyzes which answer the mentioned questions. In this analysis, we show how delivery guarantee of both messages in the communication layer, and requests in the application support layer can be assured. We argument for splitting the system into layers as a way of introducing fault tolerance. 5. The hospital application. The work flow application is an application with disconnected operation.

36 24 In section 8, we analyze the application that we use to demonstrate our system. The application supports a work flow process in an environment with disconnections. We analyze different approaches used in work flow management systems with support for disconnections. We analyze the synchronization process when data from many disconnected sources has to by synchronized.

37 Chapter 4 Client server models In this section, we clarify the concepts and analyze how we can build systems that support a client-server model. We will propose to use an internet type client server model. The internet type client server model is simple and very general. It can be used to perform both reads and writes on the data objects. We split the analysis of the client server models into following parts: 1. Basic concepts. In this section, we will present the concepts of client and server, and their responsibilities. We will show that client server programming can be used on different abstraction levels. 2. Simple request/reply scheme. In this section, we will present how client server communication is organized. This will give a more detailed picture of the client server communication by explaining layers in the client server model. 3. Analysis of client server layers In this analysis, we will discuss the general functionality of the layers in a client server system. We will discuss the issues of distinguishing requests, addressing, and reliability. 4. Choosing client server model structure and interface In this section, we will present two different client server models: Object oriented client server model and internet type client server model. We will chose the later as the interface in our system. 25

38 4.1 BASIC CONCEPTS Basic concepts The client-server model is a way to organize a distributed system. In the most simple system one application act as a server and one application act as a client. More complex systems can be build where many applications participate acting the role of both server and client. As shown on figure 4.1, even simple client server systems can become complex and give problems. Figure 4.1: Client-server applications cooperating. Each application is a server and a client If not carefully designed, deadlocks can occur in such systems. The client server model can be seen from the perspective of a single invocation and as a model for designing a complete system. When using the client server model as a way to design systems, the idea is to structure the system into a few trusted servers that offers services to many untrusted clients. The servers often control databases and offer access to data. The client server system makes it relatively simple to synchronize access to shared data. By having a limited number of servers that control data administration is generally more simple than in system with more distribution of data and logic. On the other hand, many applications using one server can result in performance problems where server becomes the bottleneck. As mentioned in section 2.3 we think the client server model is a good approach for system design in many commercial types of applications. We will not discuss the client server system as a model for system design in more detail. In this section we only look at methods that support building such systems. We will consequently only look at the client server model from the perspective of a single invocation. The client server model is used in different levels of abstraction depending on system support. In the most abstract uses like CORBA or RMI the client-server model is integrated with the programming model (object oriented) and communication-protocol issues are completely hidden for the programmer. In the least abstract the programmer implements all aspects of the model including low level communication.

39 4.2 SIMPLE REQUEST/REPLY SCHEME Simple request/reply scheme Client-server protocols can be constructed in different manners. In order to get a clear picture of the components in a client server protocol we will discuss a simple request and reply scheme. A single invocation can be described by the following steps: The client sends a request message to the server in order to ask for a service (eg. a document). The server then processes the request by executing server-side code and sending a reply message to the client. The reply message includes the requested data or an error code [Tanenbaum 1996]. The client inspects the reply message upon reception. Figure 4.2: Model of a simple client server system. The application (Layer N+1) uses primitives offered by a system support layer (Layer N). The system support layer uses a protocol stack - Communication layer (Layer 1..N-1) - to transport bytes between the client and the server. The figure shows three basic responsibilities in three layers. In some systems the three layers are integrated, but in this section we will argument that systems should be build using these three distinct layers. On figure 4.2 we show a general model for a simple client server system. As indicated in figure, a client server system can be separated in three vertical layers, with different responsibilities. A communication layer that moves bytes between client and server (layer 1 to N-1). A client-server application support layer (N) that defines and implements the Request primitive and calls the correct layer N+1 code on the server side. An application layer that uses the primitives offered by the application support layer.

40 4.3 ANALYSIS OF CLIENT SERVER LAYERS 28 In many systems the layers are not clearly separated. In order to make the discussion clear, we strictly separate the layers. In addition, separation in layers gives a more flexible design. A synchronous system functions as follow(in section 6 we will show an asynchronous system): Layer N+1 issues a blocking request-call (Using the Request service offered by layer N). In layer N, the request method and its parameters are packed into a message, in accordance with the client server protocol definition, and send to the server using a blocking send primitive from layer N-1. The server has previously called a blocking receive (in layer N) and is listening for incoming request messages (Using a receive service offered by layer N-1). When a request message is received, the message is unpacked, and the request method and its parameters are identified. The particular server code that should handle the request is identified and called (A new thread is typically made for handling the request). When the server has handled the request, the response is received as a return parameter in layer N. The response is sent to the client using a send primitive from layer N-1 on the server. The response is received in layer N on the client, and upon receiving the response, the layer (N) unpacks the request and returns it to Layer N+1. Notice that layers offer services to layers above it, by offering different primitives. 4.3 Analysis of client server layers When analyzing client server layers, we are not interested in the specific client and server application logic, but only in the primitives that the system offers to the programmer. In the following, we will only discuss the two bottom layers: 1. Communication layer, and 2. Application support layer Communication layer When designing a client server support system, we have to choose the underlying communication protocol stack. For efficiency, the protocol stack can be reduced to include only the physical and data-link layers. A larger protocol stack including the transport layer

41 4.3 ANALYSIS OF CLIENT SERVER LAYERS 29 (UDP or TCP) can also be used. The protocol is often connectionless but it can also be connection-based (eg. socket-programming with TCP). The characteristics of the underlying communication layer influences the design of the client server protocol. If the communication layer supports reliable communication, it makes it easier to construct reliable client server protocols. In the following, we will assume that the underlying communication layer offers reliable communication. The communication layer in general offers a send and a receive primitives. In a synchronous system, these primitive are blocking and in asynchronous system the primitives return the control to the application immediately after issuing. Usually a synchronous communication is used, because programming with synchronous primitives is generally considered easier Application support layer The application support layer offers primitives to the application layer and uses the communication layer for transporting data. A number of issues have to be addressed when implementing the application support layer. We will only discuss issues that have direct influence on our system. The issues are: 1. Definition of primitives. A part of the client server system is to define communication primitives. This section will present what primitives are usually used in client server systems, and how they are distinguished. 2. Addressing. The way of addressing different client / servers must be chosen. We will present the different ways of addressing. 3. Level of reliability. The client server support system may handle some errors. section we will present how and what errors can be handled. In this Definition of primitives One of the most important tasks, when designing a client server system, is to decide what kind of primitives we want to support.

42 4.3 ANALYSIS OF CLIENT SERVER LAYERS 30 A number of primitives is typically supported in client server systems 1, but for clarity we will only discuss a request primitive. Request primitives are issued by the client applications. Servers typically offer more than one service, and clients have to be able to differentiate between the services. This problem can be solved in two ways: 1. By supporting more than one request primitive. Example: request1, request By parameterizing a single request primitive. Example: request1(parameter1), request1(parameter2),... Note that parameters may include more than one encapsuled parameter. Of course, the method can be combined so we can use many request types that are parameterized. If there are many types of requests, it will in general be a good idea to define a few request primitives and parameterize these to distinguish between them. As we will show later, we use this method in our client server system. Addressing the clients and servers The client and server should be able to identify each other. 2 The addressing format is in part determined by the communication layer addressing and in part by a choice of how to address applications in the system. The client must have a way to obtain the address of the server. There are three different schemes of addressing, used in a request-reply protocol [Tanenbaum 1995]: Machine addresses written directly in program code (eg. IP-addresses) Processes discover each other by broadcast. Processes know the names and look-up addresses in a naming service. In small static systems, the first two are useable. For large systems, the naming service approach is typically used. 1 Request (REQ)(used on client side), Reply (REP)(used on server side), Acknowledge (ACK), Are You Alive (AYA), Has the server crashed I Am Alive (IAA), The server has not crashed Try Again (TA), The server has no room, Address Unknown (AU) No process using this address. 2 At least the client should be able to identify the server. The server may be passive and doesn t need info about clients

43 4.4 CHOOSING CLIENT SERVER MODEL STRUCTURE AND INTERFACE 31 Reliability in client server systems Reliability involves an analysis of how errors are detected, handled and survived. At the most basic level, reliability is introduced by using acknowledgements at some level in the client server protocol stack. When the client issues a request it generally waits a predefined amount of time. If the response has not returned within this time, it considers this as an error. Notice that it depends on the particular system, where in the protocol stack the acknowledgement might time out. If layer 1..N-1 uses non reliable protocols, some sort of acknowledgements has to be used in the client server protocol in order to ensure reliability of the system. 3 However, if a reliable protocol (like TCP/IP) is used under the client server protocol, explicit acknowledgements may not be necessary. If the response is not received within a certain amount of time, the underling reliable protocol will time out. The important point is that at some level in the protocol stack a timeout will occur if the request or reply messages are lost. This timeout can be used to indicate an error to the application using the client server system. subsectionsummary - Analysis of client server layers A client-server support system can be divided in a communication layer and an application support layer. The communication layer is used to transport data in the system. It can offer synchronous or asynchronous communication primitives. Application support layer uses the communication layer to transport data. The application support layer usually offers some request primitives. The request primitives can be distinguished by request types and by request parameters. Typically a few request parameterized types are used. 4.4 Choosing client server model structure and interface There are many different ways to organize a client server support system. In order to choose a model we look at two different models (presented on figure 4.3): 3 In the most simple systems the return data is seen as an acknowledgement (ACK). If the request could not be serviced an error code is returned. More complex systems might use acknowledgement messages for every communication step. A compromise solution is to set a timer in the server upon reception of a request. If treatment of the request takes long time, an ACK is send back to the client. An ACK can be used either on each package or on whole messages.

44 4.4 CHOOSING CLIENT SERVER MODEL STRUCTURE AND INTERFACE 32 Figure 4.3: Client server models. (a) An object oriented model. Requests go through object interfaces. The programmer just issues requests and does not need to know where the objects are placed (issuing A.methodA and C.methodA is syntactically identical even though objects are placed different in places). (b) Internet type model. Every request to the server is issued using an interface offered by the server. 1. Object oriented client server model. 2. Internet type client server model Object oriented client server model The object oriented client server model (OOM) (figure 4.3 case (a)) is an abstraction of a client server model into an object oriented programming model. The application is not aware of the communication going on behind the application support layer. This model is used in number of systems that have tried to relive the programmer from the burden of deling with communication issues. In addition, there has been movement towards integrating the client server model into the programming languages, so the programmer can use the same syntax to access object situated on the server. 4 In contrast to the internet type client server model, the major characteristics of OOM is that methods are accessed through objects. The requests that the 4 The basic thought behind RPC-systems, RMI [Niemeyer 1997] and CORBA [Siegel 1999] is that dealing with remote resources should be handled with a programming language syntax that was the same as local resources

45 4.4 CHOOSING CLIENT SERVER MODEL STRUCTURE AND INTERFACE 33 client can make on objects (on a server) are defined in the object interfaces (typically defined in a special interface definition language). This model gives a very flexible and stringent way of dealing with requests. The issuing of object methods on server is handled by sending a set of requests to the server which executes the object methods. The object distribution i.e. addressing, naming, communication, is handled by special processes(for example brokers in CORBA). The processes are placed in the application support layer. The object oriented models (like RMI) are very complex. Keeping in mind that later we want to expand the model with reliability and support for disconnected operations where we might have to synchronize changes on locally downloaded objects with remote copies, we can expect that it will get even more complex. Hence, we think that the OOM is too complex for resource weak devices that are in our focus Internet type model Another approach is to use an internet type model (figure 4.3 case (b)). The model is very simple, flexible, and widely used. The model is typically used in the form of the http-request model and it is often identified with the http-request model. However, it should be stressed the the internet model is more general and is not limited to any specific protocols. There is a number of system tools and libraries that are available for this model and which ease programming using it. Using internet programming libraries on the client side and servlets, CGI scripts or PHP on the server side the programmer doesn t have to worry about communication issues. The main model characteristics is that it uses a set of request primitives that are defined by the server. The programmer using the model is aware of client server communication. The addressing is performed explicitly (using URL) 5 and requests include method and parameters. From the application point of view, the request is addressed to a server that will interpret the request, perform a defined action, and return the result. The types of requests can be very complex and it is difficult to foresee witch kind of requests programmers will have use for. The solution has been to define a few general requests that can be parameterized by the programmer. Therefore, we can reduce the number of request in internet type request reply model to tree: GET, POST and PUT. 6 5 URL is presented in appendix A.3 6 In Http we use a number of requests: PUT Sends complete data to server. Is used to submit form results or append data to the server. In is rarely supported on web-servers today. POST Sends changes to server. Is usually used for updating documents.

46 4.5 CHOOSING CLIENT SERVER MODEL STRUCTURE AND INTERFACE - CONCLUSION 34 Because of the simplicity and flexibility of the internet type model s we chose this model as the interface for our client server support system. 4.5 Choosing client server model structure and interface - conclusion Summarizing, we chose the internet type model as the interface for our system. It can model the same functionality as the object oriented model. By giving the user the responsibility for performing requests, the model stays very simple and does not require as strong background processing as object oriented model (for example ORBs in CORBA). The internet type client server model is very general and to a large extent relieves the programmer of low level communication issues. Furthermore, the development of different technologies (servlets, asp, jsp) has made the task of programming client server systems less complex, since the application support is programmed. On the WWW URL is used as the addressing method for internet type client server models. We will use the same method. 4.6 Summary - Client server models We think that the client server model is a good approach in many practical applications including the work flow application which we want to build. In order to support the client-server model, we have examined in detail how to support a single invocation. Depending on application support, the client server model can be used in different levels of abstraction: from most abstract where communication is hidden for users (CORBA, RMI) to least abstract where user programmer must implement all communication aspects. The typical client server model is divided into application, application support and communication layer. The upper layers use the underlying layers functionalities. We have chosen an internet type client server model, that uses URL 7 addressing. The http-protocol is an example of the internet type client server model. The internet type client server model is characterized by having GET Fetch object (typically html-document). Most commonly used to fetch documents. DELETE Delete a resource on the server. HEAD Fetch a header for document. Is used to query the server. POST and GET are the most used requests, and are implemented by the most Http servers 7 URL is explained in A.3

47 4.6 SUMMARY - CLIENT SERVER MODELS 35 a limited number of parameterized request types that are handled by the server. Most systems using http support only GET and POST requests.

48 Chapter 5 Client server system for hand held computers In this section, we will expand our analysis of the client server model by analyzing the implications of using hand held computers. We will look at different theories for handling resource weak devices. We will choose a browser solution to reduce the load on the hand held device. We chose to let the server adapt to different types of devices. Additionally, we will choose an proxy-agent design to handle the general network limitations of the hand held computer. We will use the design to multiplex multiple requests through a single connection. We will not support compression nor differencing of data, because it loads the hand held computer with unnecessary processing. We will structure this chapter in the following sections: 1. Processing and memory limitations. The reduction of processing is mainly about transporting as much of the processing from the client to a stronger server. In this way, the resource weak computer can be unloaded. We will analyze thin client models and we will argument for building a browser type thin client model on top of our general client server system. 2. Device limitations. There are many types of hardware platforms of hand held computers 1. In systems with clients with different display capabilities (for example screen size) we often have to filter and transform data from a server to 1 In addition there are 3 dominating operating systems: Windows CE, PalmOS and EPOC 36

49 5.1 PROCESSING AND MEMORY REDUCTION 37 fit the particular clients. In this way, we can reduce the demands for processing, networking, and display resources and adapt to what the particular device can handle. Data transformation typically includes resizing of pictures, formatting text and reducing colors. In this section we will discuss how device specific adaption can be supported in our model. 3. Network limitations The limited network recourses available to most hand held computers demand special attention and is discussed in section 5.3. Later, in chapter 6, we will discuss how to handle disconnection but in this section we will focus on problems that are present even when we are connected. In this analysis, we will show that it is preferable to multiplex communication over a single (socket) connection instead of opening multiple connections for each request between the hand held computer and the fixed network. Additionally, we will show that a proxy-agent design can be used to provide bandwidth reduction facilities in general. 5.1 Processing and memory reduction In this section we will analyze technics to offload computing tasks from clients running on a hand held computer. When using hand held computers as clients, we are facing a dilemma between off-loading the work and independent work: Because of the scare processing resources, we want the server to perform as much work for the client as possible. This requires more communication between the client and the server. On the other hand, hand held computers tend to be relative weak with respect to bandwidth and stability of network connections, which leads to disconnected behavior. This is an argument for letting the client work autonomously. In order to do that, we have to place a bigger part of processing on the client. Client-server models can be classified depending on the amount of work the client computer off-loads to the server computer. The extremes in this classification is thin and thick client. We chose a model that combines them: 1. Thick client model. In thick client systems the client performs a considerable amount of computing and only uses the central server(s) sporadically. Server data is often distributed (replicated) to the client. The systems are generally used to off-load servers and make scalable client server systems[howard 1988].

50 5.1 PROCESSING AND MEMORY REDUCTION 38 The thick client systems set requirements for the resources on the client computer. 2. Thin client model. Thin client systems are more interesting from the perspective of hand held computers. In its pure form the idea is that the client should function as a terminal. The application is completely executed on the server and only screen updates are sent to the client. Mouse and keyboard actions are sent to the server. Server is the only execution unit, while client is just the user interface. By using a light protocol and only sending updates the use of bandwidth is very limited. CITRIX Corporation claims 10 kb/s is adequate [CITRIX 2001]. CITRIX Corporation has developed a thin client architecture that supports a variety of platforms including PDAs. A server controlling the applications is executed on a desktop machine (running Windows NT) and communicates with the thin clients executing on remote computers, using the Independent Computing Architecture protocol (ICA). Other example of thin client is virtual network computing (VNC). VNC systems usually have a process that executes on server. The process sends the screen update to the client and fetches user actions from the VNC terminal. The basic problem with these types of thin clients is that disconnections and a high network latency is detrimental for them. 3. Browser model. The concept of thin clients has changed silently in the later years. The development of the WWW has resulted in a new kind of thin clients - web browsers. Using a browser, a client can download objects like web-pages stored or generated by a server. The client can, then, initiate requests as links or form submits. In this kind of computing the application is, by the most part, executing on the server. Some computing takes place in the browser, though. To what extend the application is executing on the server or the client can be controlled in detail as it is possible to download executable objects (eg. Scripting languages in web-pages and links to Java Applets). In this context, the notion of a thin client has been difficult to define because it is rather subjective to say how large portion of the application should be executed on the server before we define the client as a thin client. By moving code to the client, we move further away from the thin client idea, but make the client more autonomous.

51 5.2 DEVICE-SPECIFIC ADAPTATION 39 The browser type thin client has an additional advantage. By using it to transport and execute mobile code, we are able to make applications that have bigger functionality than forms 2. This puts some additional requirements on hand held device, though. We think that a browser solution is a good general solution for building thin client applications. The interface for our client server system that we chose in section can also be used in a straight forward way in a browser. As a consequence, we chose that our design should contain a browser solution. On the other hand, we think that a browser solution also restricts the types of applications that can be programmed. Therefore, we chose to offer not just a thin client server model but also the general client server system to the programmer. We will not integrate the browser with the client server support system but let the browser be an application that uses our general client server support system. 5.2 Device-specific adaptation In this section, we will analyze different adaptation techniques focusing on handling the diversity of device limitations. There are two aspects to consider in connection with handling display limitations in an device specific way: 1. Placement of adaption logic. 2. General technics for adapting to different types of clients Placement of adaption logic As shown on figure 5.1, the adaptation can be placed on client, on server, or in between them, in a separate process. The later is called an agent in this thesis. Now, we will describe the three techniques and their advantages: 1. Client-based adaptation In client based adaption, the client has the complete responsibility for adapting data from the server to its capabilities. Taking a browser application as an example the client must filter the web-page received from the server. The browser can choose not to show data like graphic, video-clips or applets that consumes resources. 2 for example by using Java Applets

52 5.2 DEVICE-SPECIFIC ADAPTATION 40 Figure 5.1: Adaptation methods. (a) Client based adaptation. The client takes care of processing and presenting data in a suitable way. (b) Server based adaptation. The server generates responses that suit different clients. (c) Proxy based adaptation. A proxy distillates responses, so they suit clients Alternatively, the client can chose to transform the data to a form that can be shown. For example, the browser could downsize graphics or reject special fonts in text. It would be most logical if the browser application did the filtering or transformation processing but it could also be handled by the application support layer if it understands application logic (a proxy server). The advantage of using client based adaption is that the server is unaffected. This solution scales well and can handle many types of devices with different display capabilities. The disadvantage of the method is that it places a processing load on the client.

53 5.2 DEVICE-SPECIFIC ADAPTATION Server-based adaptation In server-based adaption the server is responsible for executing client requests and sending replies that suite to the clients. The server has to detect what kind of device that is contacting it. The client either has to describe its capabilities in the request, or the server has to have a repository of the client capabilities by identifying the client (for example by its ID). The advantage of the method is that the clients are relived from all processing. The disadvantage is that the server has to contain logic for handling all types of clients. If the server is out of our control this method can not be used. 3. Agent-based adaptation The agent-based approach uses a separate agent that optimizes communication and formats the data, so they suite to the hand held computer. Generally, the agent hides a thin client from the server, which thinks that it communicates with a normal client. This solution is extensively documented in literature ([Fox 1998], [Kunz 1999]) and also used in many practical systems. 3 In the agent solution the server functionality is unchanged, and still, different clients can be handled. The responsibility of adapting server responses to the different clients is overtaken by an agent. Advanced adaptation agents can be constructed that offer many services like data transformation, aggregation, customization and caching ([Fox 1997], [Fox 1998]). For advanced functions the agents should not only know the capabilities of the hand held computer but also be able to understand content and sometimes application logic. The disadvantage is that the agent should be able to understand the data. This means that some application logic should be places on the agent. In case of web applications the agent logic can be described well and the agent solution is manageable. Additionally, the communication is delayed by being send through one extra node, which increases the latency. It is possible to mix the approaches. As we will show in section 5.3 we will chose an proxy-agent design in order to be able to multiplex many requests 3 There exists numerous systems that function as WAP gateways. The gateways transforms HTML to XML that can be viewed by wap-browsers. An examples of such systems is Babelserver that allow users to view standard Web pages on a wap-browser. Documentation about Babelserver can be found at

54 5.2 DEVICE-SPECIFIC ADAPTATION 42 onto one connection. In section 6 we use the same agent to transform between the asynchronous client server model on the hand held client to a synchronous model on the server. It would be logical to extend this agent with adaption processing capabilities for different types of devices. However, it should be noted the adaption process has to understand the content and sometimes even application logic. The agents we have just mentioned reside in the application support layer. This means that we should ad the necessary logic to understand the content that is transported or use a server-based adaption process. Making a correct transformation agent is a complex task. We chose the server-based adaption process for convenience General technics for adaption Adapting server responses to the limited device capabilities, in general involves two aspects: 1. Identifying client capabilities 2. Transforming server response Identifying client capabilities In server-based adaption and proxy-based adaption the adaption process needs to know the client s capabilities (and limits). If the adaptation takes place on the client, the request doesn t need to be identified. The clients can be identified either by including capability information in the request or by identifying the client type in the request. In the first method, we risk that the request message can become very big, in order to describe all the necessary properties of the client. In the later method, the adaption process needs to have a repository with the client device capabilities (for example screen size, color and operating specific details). On the WWW, browsers use the first method i.e. sending information about client capabilities in requests. We chose that the request should only identify the client type. The number of hand held computers that can be used with the system is not so big so the capability database will be relatively small Transformation of server response The server response has to be transformed to fit the clients capabilities. Taking a web-application as an example, a server response encoded in XML

55 5.2 DEVICE-SPECIFIC ADAPTATION 43 could be transformed into other encoding types. 4 For normal workstation client, HTML syntax might be used. For hand held devices, a syntax with limited functionality could be used. An example of such syntax limitations are: minihtml, text, minixml, WML. Even within the same page description format, the presentation capabilities of different devices can differ. Therefore, apart from adapting forms to the format accepted by hand held devices, we need to change some of the data displayed. The commonly used technique for adaptation is distillation. covers compression and filtering technics. Distillation In filtering, data that can not be displayed is removed. This could be videoclips or graphics. Losy compression can also be used [Fox 1998]. Graphic solution could be reduced, color pictures could be transformed to black-and-white. Knowing the limits of a browser that runs on a hand held computer, we can decide how to filter and compress data. Distillation can also be connected with refinement. Refinement is a process opposite to distillation. An example of refinement is a picture that gets reduced. By clicking on it we may request a certain part to be more detailed. In our case, we can control the server implementation. Therefore, we chose a solution where the server returns data in a device and protocol independent way. A special module is given the responsibility to encode the response that fits the requesting client Summary - Device-specific adaptation There are many types of hand held computers with different capabilities. In this section we chose to let the server adapt its responses to the client device. 4 For example to WML and HTML by using XTS 5 The most solutions, use a protocol specific distillation. The protocol specific distillation can adapt dynamically to environmental changes. En example can be a streamed video, that gets reduced in quality, when bandwidth gets limited. Protocol specific distillation can use information about the used data, to optimized its transportation: we can allow a loss of data using audio streaming, we can not do it transporting transaction data. Distillation often bases on data type diversity (presented by Odyssey project [Noble 1999]). Data have different characteristics (video uses two qualities: frame rate and quality, picture use: color, resolution and size). These characteristics are used to make more efficient distillation, that affects the service quality minimally. In order to support dynamically adjustment to the network environment, there is a need for a service that measures the network performance. BARWAN project introduces SPAND (Shared Passive Network Performance Discovery) that takes care of that. For details look at [Brewer 1998].

56 5.3 NETWORK LIMITATIONS 44 Alternatively we could have used an agent to transform a standard response from the server or we could have let it be up to the client to dentil the response. The later is not useful when the client is a hand held computer. We decided against the agent-solution because we do not want to place the application logic necessary to distil responses outside the server. 5.3 Network limitations Network limitations present a serious problem in client-server systems. These problems are especially clear when a thin client model is used. Even though we chose a browser model where we do not have to communicate continuously we still have to communicate with a central server from time to time. In this section we discuss how to handle network limitations of hand held computers while we are connected. In chapter 6 we will go a step further and handle disconnections. To find the solution for the network limitations, we will structure the section in three parts: 1. Network adaptation problems. Firstly, we present what network adaptation problems that we face when using hand held computers. We will answer what network adaptation is, and what part of it is important. 2. General design Secondly, we present a general model that can be used for handle network limitations. In order to run adaptation processes, we generally need a special functionalities on both sides of the connection. 3. Adaptation methods. Secondly, we present methods that can be used in the mentioned model to adapt hand held devices to network limitations Network adaptation problems The problems of network deficiency can be classified into limited bandwidth, latency, stability and variability: Limited bandwidth. As mentioned in 2.2, hand held computers typically have orders of magnitude lower bandwidth to their deposal in comparison to stationary computers. This means that it might be necessary

57 5.3 NETWORK LIMITATIONS 45 to use technics like compression or caching to minimize bandwidth consumption It should be noted that the bandwidth if often quite high in absolute terms (Wireless LAN: Often 1-11 Mb/s). When reducing bandwidth consumption, we often make a trade-off with processing on the hand held computer. Therefor, one should carefully consider whether the applications really need more bandwidth than offered. Latency is often a more serious problem than bandwidth [Chang 1998]. If many connections are made from one hand held computer and the connections are unstable, then delays in applications can occur. Since opening sockets and accepting replies is a time consuming task, especially in case of often disconnections, a solution for this problem is based on multiplexing many connection in one. Alternatively, selecting a protocol that suits the physical media also can help in reducing latency (by avoiding extensive handshaking). Stability. When hand held computers use wireless network connections, problems with stability have to be considered. We will discuss this problem in detail in 6. Variability addresses the issue of changing network environment. When a hand held computer uses wireless network connections, the variability of the service might have to be considered. For example, if a number of network users fluctuates, the available bandwidth will also change. Sometimes, we need a continuous service. This implies that we can either chose to use the lowest bandwidth all the time, or that we will use technics for levelling out the variability (for example by using buffers) General design In this section we present a general design that makes it possible to use many different methods for handling network limitations. The general design builds on the conclusion that an adaptation processes generally need functionalities on both sides of the connection. The design has been widely used, but was used in an especially clear way by [Housel 1996] for handling network limitations when web-browsing using a wireless network connection. In figure 5.2, we see a model of the design. The design is constructed by the application, a proxy, an agent, and a server. The communication is handled by the proxy and the agent. The application executes requests that are handled by the proxy.

58 5.3 NETWORK LIMITATIONS 46 Figure 5.2: General model. The general model is structured in an application, a proxy, an agent and a server. The proxy and the agent handles the communication and can use different technics to handle the network limitations of the hand held computer. The application can execute requests directly against the proxy, but sometimes we want that the proxy is transparent to the application. In that case, the proxy-agent system tunnels requests. This is the case when we do not want to change the application in any way. In this case, we can use an intercept model. Request issued by the application can be intercepted by operating system or by a loop-back mechanism. Figure 5.3: Intercepting proxy-agent design. The proxy and the agent are processes different from the application and the server. The proxy and the agent handles communication between the hand held computer and the fixed network. The loop back mechanism is very flexible and a widely used [Housel 1996], [Seybold 2001]. It can be implemented by having a local proxy listening on a local port (figure 5.3) and by having the application directing its calls to this port instead of the server address.

59 5.3 NETWORK LIMITATIONS 47 In the following, we will show how a number of technics can be used in this model Adaptation methods In this section we will describe different technics for handling some of the problems mentioned earlier. We will describe how the general design mentioned above can implement the technics. Caching Client that requests the same data often, doesn t need to load the network with data. A copy of a previous request can be used instead, reducing the load of connection ([Chang 1998]). Caching sets a higher requirements for memory of the hand held computers. Caching introduces some problems: There should be a method that secures validity of data. In addition, if we allow write access to cached objects or data are accessed by multiple clients concurrently, then data consistency should be handled. If there is a hight degree of locality in requests, caching can limit bandwidth consumption and latency. By scheduling pre-fetching of data to times with high bandwidth and when the application is not active, network limitations in general can be masked by trading off memory. A cache placed in the under the control of the proxy can thus be used to reduce network traffic. In chapter 6 we will show how this technic can be used to handle disconnections. Differencing Differencing is a process that only sends update of objects in stead of whole objects. In differencing, the idea is to cache a common base object on both sides of the communication ([Housel 1996], [AirAcesss 1994]). The sender computes a difference between the base object on both the sender and receiver side and sends only the changes. The receiver uses the sent differences to compute the new object. Differences can be computed on the binary level ([Coppieters 1995]), but application knowledge makes differencing easier. Notice that differencing and caching can be used together to reduce bandwidth consumption In our system we will not use binary differencing. Early in the project we made some informal experiments and came to the conclusion that constructing objects on the hand held computer from a base object and a difference is to demanding. Constructing a binary difference file is even more demanding and is not a realistic approach for hand held computers.

60 5.3 NETWORK LIMITATIONS 48 In our system we use differencing on an application level. When submitting a form in our web-browser, we generally do not POST to the server the complete web-page but only the entered data. The server application logic knows how to interpret the data. Compression The data that are send may be sent in non efficient way. This is typical for protocol like HTTP. HTTP protocol data can easily be compressed by using simple bit level or protocol specific compressors, and significant reduction of data may be achieved. In the general model compression requires, that the proper compression/decompressing algorithms (like zip) are used in the proxy on the hand held computer (and on the agent). This might be very demanding computationally. We trade off processor time with bandwidth consumption. We generally think processing power is more scarce than bandwidth and has consequently not used these technics. 6 Protocol reduction Running protocols used on the fixed network often includes an extra overhead. When many protocols are kept on a communication line with low bandwidth, the unnecessary overhead gets bigger. One way of reducing the protocol overhead is to tunnel many connection through a single one ([Seybold 2001], [Housel 1996]). The protocol used to tunnel data could be selected to be optimal for the particular physical media. This solutions requires that on both sides of the communication tunnel, we have an packing and unpacking engine. This engine can be used to utilize differencing or compression methods. Protocol reduction can reduce bandwidth consumption and latency by amortizing the communication cost over a number of connections. This solution is clearly usable in the general design. In our system we use multiplexing of multiple connections onto a single connection between the and held computer and the fixed network. Header reduction Running state less protocols like HTML, the header describing connection, browser capabilities, or other information are sent every time. This header can be reduced keeping a session info on both side of a connection tunnel (like in previous example), and send only first time. The next time the client doesn t need to send the header. The on-line part of the connection adds the header and executes requests [Housel 1996]. Header reduction reduces bandwidth consumption. 6 zip-algorithms are available as standard JAVA libraries

61 5.4 SUMMARY - CLIENT SERVER SYSTEM FOR HAND HELD COMPUTERS 49 We have chosen not to support this kind of technic even though it can easily be implemented in the general model. Incremental results By receiving data incrementally, we can ensure responsiveness of a system even though the available bandwidth drops suddenly ([Forman 1999]). A browser that shows a picture in increasing quality is an example. This technic can be used to handle variability in bandwidth. We have chosen not to use this technic even though we have build a browser solution on top of our general client server system Summary - Network adaptation Summarizing, because we find the problem of latency to be must important, we chose to use caching and protocol reduction by multiplexing multiple connections onto one connection. We generally will not trade processor power for bandwidth. We have examined differencing technics but have come to the conclusion that the trade-off with processing can not be justified on a hand held computer. Likewise, we will not use compression technics (zip) even through there are many libraries available. 5.4 Summary - Client server system for hand held computers Client server systems for hand held computers can be build as thin client systems or thick client systems. In thin client systems only screen updates are transported to the client and all interaction is sent to the server. The thin client is very sensible to network latencies, because every user action is sent to the server. In the thick client the whole application executes locally and only makes simple requests to the server. The thick client has high requirements for the hand held computers resources. We have chosen browser type thin client solution, that can be seen as a solution between the thin and thick client. A browser (for example WEB browser) is not as dependant on instant communication as the pure thin client is. At the same time, the browser is not as demanding for hand held computer s resources as the thick client. Because browsers can execute on different hand held computer platforms, there is a need for adapting the responses to the specific browser s environment. The adaptation can take place on the client, server, or on a agent.

62 5.4 SUMMARY - CLIENT SERVER SYSTEM FOR HAND HELD COMPUTERS 50 Due to hand held computer limitations, it is preferable that the adaptation is not executed on the hand held computer. The agent solution is too complex. Consequently, we have chosen to let the server adapt its responses to the specific devices. In order to handle the general network limitations of the hand held computer we use a design based on a proxy and an agent that has the responsibility to communicate between the fixed network side and the hand held computer. This design is very general and can be used to implement many different technics. We only use this design to multiplex requests through a single connection. In addition we chose to use caching to reduce network latency. Exactly how this cache should be designed will be discussed in the next chapter.

63 Chapter 6 Client server model with disconnections In the discussion of client server systems up to now, we have generally assumed that the client and server are connected in the duration of the request. In the internet type client server protocol, using http-requests, the requests are synchronous. As discussed in section 2.2, we can expect disconnections when using hand held computers. In this section, we will discuss how we can expand our system, so it can handle disconnections. We will look in literature to find theories, that can be used to support disconnections. The solution will be chosen with respect to the limited resources on the hand held computers. We will structure the analysis in the following sections: 1. Important issues concerning disconnections. We start by defining what disconnection is and how it can be handled in general. We shortly discuss how the applications demands affect the solution to the disconnection problem. 2. Analysis of overall approach. In this section, we will discuss the advantages and disadvantages of different approaches for handling disconnections found in the literature. At the end of the section, we will propose a solution based on a general proxy on the hand held computer and protocol specific agents on the fixed network. 3. Analysis of the proxy. In this section we will go into details on the analysis of the proxy. We will introduce a cache, which will support general reads and writes 51

64 6.1 IMPORTANT ISSUES CONCERNING DISCONNECTIONS 52 of objects. We will show different behaviors for dealing with simple object servers, and for dealing with complex servers. We will propose to use a general proxy, which is very simple, and can be executed on a hand held device. 4. Analysis of the agent. In this section we will go into detail on the analysis of the agent. We will analyze a simple agent that only translates asynchronous request into synchronous ones. We will also discuss complex agents that can perform more complicated tasks, like supporting write-backs of the objects. Often, dealing with complex agents requires some modifications to the server. In this section we will also present the impacts for the server. 5. Analysis of the MOM communication layer. The proxy and agents communicate using a message-oriented-middleware (MOM) communication layer. We will analyze some details connected to the use of this layer in our system. We will analyze how the data are transported from one computer to another, and how the system handles the disconnections. 6. Analysis of browser application. Because we create a browser that is working disconnected, we will shortly discuss how application awareness of disconnections can be supported in our system. We will present methods to control the system behavior in a disconnected state. We will show that application behavior (for example how long time the browser should wait for the response) can be controlled from the client, the server or by giving hints to the browser. 6.1 Important issues concerning disconnections The concept of disconnection is often defined implicitly by the action of removing the network cable from a computer. Working disconnected in this context is identified with the concept of working autonomously on a computer being disconnected, and later synchronize the changes to other computers on the network, when the network cable is plugged in again. We see this definition, as only a special case of a more general concept of disconnection. We define disconnection as not being able to uphold a network connection at some point in the protocol stack.

65 6.1 IMPORTANT ISSUES CONCERNING DISCONNECTIONS 53 This may apply to any protocol layer (data link layer/transport layer/application layer). In reality, in systems that can handle disconnections, there will always be some kind of connection at some level in the system. In some cases it can occur in a subtle way in the application logic as being able to synchronize changes to the server. In more straight forward ways, it can be seen as virtual connections (or correlation ID s) in message-oriented middleware systems. The implication of our definition is that the concept of disconnection is affected by the characteristics of the physical media (cable, radio waves...) and the protocols using it. Disconnection will occur when the physical media is disturbed in such a way, that the used protocol can not uphold a connection. Notice, that another protocol might be able to use the physical media, where the protocol in question fails. Disconnections can also occur i case of server or client break down. The solution to the disconnection problem is to change the protocol to a more resilient one, or to build a protocol layer on top of the failing protocol that handles the disconnections. The new protocol layers often take the form of synchronization mechanisms between remote server data and locally cashed server data. The concept of disconnection also implies that it is a state for a duration of time. The solution depend on the duration and frequencies of disconnection and on the time constraints the client application have. If disconnections are longer than the application logic can tolerate it is necessary to move server data to the client. This is often the case when using hand held computers with IR, Bluetooth or serial cable network connection because this type of network connection result in very extended disconnections (see section 2.2.2). If the application logic s time constraints are less tight than the disconnections the problem can be handled by changing the protocols that the application use to more resilient ones or write the application using asynchronous primitives. In the later the application can take into account some delay in the responses and still perform the intended job. When using a using a hand held computer with wireless LAN network connection we can expect frequent and short disconnections (Intermittent connectivity) and will often come in this situation. If support for network breakdown is needed, extended periods of disconnections has to be supported. As mentioned above, the reason disconnections are a problem at all is because the application have some time constraints on receiving the response for a request. The application might need the result in order to continue. Submitting a log-in form is an example of this. How strict the time constrains are for receiving the result of a request, depends on the type of application.

66 6.2 ANALYSIS OF OVERALL APPROACH 54 Many requests are not followed by requests that depend upon them. When sending an we have no time constraints on the action because we do not have any actions that are depended on the result of this action. Similarly filling out a form and submitting it is often a single operation that can be conducted in one request. These types of request are to some extend time-independent. Thus we can just queue these types of requests while disconnected and forward them when connected again. Notice, that we might be able to structure the hospital work-flow application that we are going to build on top of the general system in a way that only involves isolated requests. This is the case if every action on the same hand held computer is independent of other requests and can be formulated as a form that can be submitted in a single request. This request can easily be queue during disconnections. It is clear that there will be requests depended on a request that submit a log-in form. In addition request that fetch the forms has to pressed request executed from the forms. However, these requests might be issued in advance and the forms cached. We will in fact use this structure in our implementation of the hospital work-flow application. In the following sections, we will analyze different approaches found in the literature for handling disconnections. They generally differ in their ability to handle disconnections of different extension of time. 6.2 Analysis of overall approach In this section we will analyze different approaches for supporting disconnections found in the literature: 1. Changing low level protocol. If a protocol in the protocol stack is not designed for a particular physical media, this might lead to often disconnections. We might solve the problem by changing a low level protocol in the protocol stack. We will show that this method is applicable only for short disconnections. 2. Asynchronous client server system. Dealing with longer disconnections, we need an asynchronous communication system. In this analysis, we will show how an asynchronous system can be assembled. We propose using a message-oriented-middleware system as the communication layer in such a client-server system. 3. Client-agent-server system.

67 6.2 ANALYSIS OF OVERALL APPROACH 55 Using asynchronous primitives requires that both the client and the server can use them. When server can not be changed, we can use an agent that performs a transformation from asynchronous to synchronous requests. In this section we will analyze systems using an agent. 4. General issues when using local server data and logic. The asynchronous client server system can not be used to support long periods of disconnections compared to application time constraints. In this case server data and logic may have to be replicated to the hand held computer. In this section, we will give an overview of some general problems related to replication. 5. Mobile object model. Using mobile objects is a way to copy a part of server data and logic to the client. Objects can be downloaded to the hand held computer and accessed during disconnections by the client application. In this section, we will present a system consideration for mobile objects. 6. Client-proxy-server. Another method is to replicate the complete server or a lightweight version of it to the hand held computer. We call this server replicate a proxy in this thesis. In this section we will analyze the proxy method. 7. Our approach. We want to build system support for disconnection in the internet type client server model. In this section we propose a solution based on a simplified general proxy and an agent that transforms asynchronous to synchronous requests Changing low level protocol If disconnections are very short, compared to the applications demands, the problem might be solved by changing the protocol stack at a low level. Protocols functioning satisfactory on a fixed highly reliable and high bandwidth network might not be usable in a wireless LAN because wireless LAN are considerable more unreliable and have orders of magnitude lower bandwidth. As an example, the TCP/IP protocol does not always function optimally in a wireless LAN environment. When packages are lost TCP will assume that this is because of congestion. In fixed network the error rate is so low that this is a valid assumption. TCP reacts by reducing the transmission rate. The error rate is considerably higher in wireless LAN. In a wireless LAN the assumption that a lost package is the result of congestion is not

68 6.2 ANALYSIS OF OVERALL APPROACH 56 valid. In a noisy environment the correct strategy might be to increase the transmission rate in order to get anything through [Tanenbaum 1996]. A considerable amount of work has been done to modify existing protocols (eg. TCP/IP) due to performance optimization on new types of physical media [Balakrishnan 1997]. Another solution is to use specially designed protocols that are optimized to the particular physical media s characteristics. In our project, we do not focus on the particular protocols that should be used with different types of physical media. Our aim is to make a general system. In our client server support system we have already abstracted the communication protocols used from the client-server layer. In addition, the modifications that we made in section 5 to handle the general limitations on hand held computers can easily be extended to handle short disconnections by choosing an communication protocol between the proxy and the agent that is optimal for the particular physical media Asynchronous client server system In case of often and short disconnections (Intermittent connectivity), the application itself can be programmed in a way that makes the disconnections resilient. Instead of using a synchronous client server protocol, we can use asynchronous primitives that are independent of time constraints. When using an asynchronous primitive, control is returned to the client immediately after invoking a request. The request-message is stored in temporary memory, and a separate process (system support) takes over the responsibility for issuing the request. If the connection is not available the process waits until it is up again. In this way the applications invocation and the communication of the request is separated (see appendix A.5 for a general treatment of asynchronous systems). In this section, we will look at two possible solutions for a asynchronous client server system. 1. A simple background process on the client. 2. MOM model. Simple background process In the most simple system presented on figure 6.1, the system support is just a back-ground process on the client computer that executes the requests syn- 1 In our implementation we actually chose to use TCP/IP, but we could have chosen any reliable protocol

69 6.2 ANALYSIS OF OVERALL APPROACH 57 chronously directly against a remote server, when it is possible to establish a connection. Figure 6.1: A simple background server. A simple background process (Request manager) issues requests synchronous to server when the hand held computer is connected. In this case the application is a browser application displaying a form-application. The system support offers asynchronous primitives that access some memory, controlled by the system support. This memory might be persistent. If we are not allowed to change the server, the server determines the protocols that can be used. These protocols may be optimized for fixed networks. Because we expect the connection to be characterized by intermittent connectivity this solution is not suitable. In addition, a new connection will be made for each request. This might exhaust the hand held computers recourses quickly. Message-oriented middleware model Instead of letting a background process on the client issuing invocations synchronously against the server, we can build the client server system where both the client and the server use an asynchronous model. We propose building a client-server system on top of an asynchronous message-oriented middleware (MOM) layer. Message-oriented middleware (MOM) [Steinke 95], is software that resides in both portions of a client/server architecture. MOM abstracts communication and is typically placed in top of the network and transport layer (figure 6.2a). MOM allows asynchronous communication between applications. Applications can push messages into and pull messages out of the

70 6.2 ANALYSIS OF OVERALL APPROACH 58 MOM. MOM offers a temporary storage for messages. This means that applications communicating with each other do not need to establish a direct connection or even be running at the same time. The temporary storage can be persistent and MOM can generally offer services like delivery guarantee, delivery order, priority options and message properties like time to live, error handling and encryption. In general, MOM can be used for making reliable distributed systems. When used on the fixed network they are typically structured as a separate processes (with queue) that is contacted by communicating applications using an interface (see figure 6.2b). When used with hand held computers, we need a more complex design. Because we want to use the MOM while disconnected, a MOM-queue has to be placed on the hand held computer. Because we want the server to be able to contact the MOM even when the hand held computer is disconnected, we also need a MOM-queue on the fixed network. In order to support disconnected work, the MOM design should be based on a minimum of two queues that route messages between them: a queue on fixed network and a queue on the disconnected device (See figure 6.2c). In appendix A.6 we have given a more detailed description of MOM. In this context we only need to focus on how a MOM can be used to support an asynchronous client server system. It should be clear from the discussion above that a client application and a server can communicate using MOM, structured as shown in figure 6.2(c). In this case, both client and server contact the MOM when they send and receive messages. We can build a client server support layer that uses the MOM for communication. The responsibility of the client server layer is to contact the MOM, pack and unpack messages, and to pair request and response messages. In figure 6.3, we show such a system. The MOM communication layer can use any protocol to synchronize messages between the part on fixed network and the mobile part of the MOM. Notice that all client server requests can be multiplexed onto a single connection between the MOM parts. Many of the technics discussed in 5.3 for handling network limitations of hand held computers are compatible with this solution. In fact, we could have chosen to use any message system that offered asynchronous communication (for example SMS). However, by choosing a MOM system, we get a reliable and general communication system for applicationto-application that offers a number of services (For details se A.6). The system can be used by applications without going through our internet type clint server layer. This is useful, if we need to push content to applications on hand held computers. In the system presented on figure 6.3, the server has to be programmed to use the asynchronous system. The server has to contact the MOM and get request messages. The server use the application support layer to unpack

71 6.2 ANALYSIS OF OVERALL APPROACH 59 Figure 6.2: Message-oriented-middleware system. a) MOM abstracts communication and is typically placed in top of the network and transport layer. b) Application communicate by pushing message to and pulling messages from the MOM queue. c) In a system involving a hand held computer that is periodically disconnected we need a MOM-queue on both the hand held computer and on the fixed network side. request-messages. However, it is often not possible to change servers into using an asynchronous model. If we do not have control over the implementation of servers (eg. web-servers) we might have to make synchronous invocations Client-agent-server system A number of systems in the literature use agents on the fixed network to execute requests synchronous against a server on behalf of a mobile client ([Oracle 2001a], [Siegel 1999], [Housel 1996], [Joseph 1997]). We can expand the system we proposed in the last section with an agent that pull out requests from the MOM and execute synchronous requests against the server.

72 6.2 ANALYSIS OF OVERALL APPROACH 60 Figure 6.3: Asynchronous client server system based on Message Oriented Middleware. The technics discussed in section 5.3 can be implemented as a part of the MOM transport. After receiving the response, the agent pushes the response message into the MOM. Because the agent translates asynchronous messages into synchronous, and protocol specific requests, we chose the agent to be protocol specific. In figure 6.4 we present a general model of a client-agent-server system. The application support on the client transforms the requests into a messages. The messages are sent asynchronously to an agent that unpacks the messages and executes the requests synchronously against the server. The response is packed into a message and send back to the MOM on the client. The application support layer unpacks the response. Finally, the application support layer sends the response back to the application. This design has several advantages: 1. It handles network limitations. The protocol used to send messages in the MOM system between the hand held computer and the agent can be chosen to be optimal for the physical media. Multiple request can be multiplexed onto the same link. If the communication link is variable and unreliable, the communication layer can hide it, and the application and server do not have to

73 6.2 ANALYSIS OF OVERALL APPROACH 61 Figure 6.4: Client-agent-server design. The agent executes requests synchronously against server on behalf of client. be directly affected. The general design presented in section for handling the general network limitations of hand held computers is thus integrated in this solution. If persistent memory is used to hold the messages on both the client and server side, the message passing can be made reliable. 2. It masks disconnections. The agent can execute requests against the server even though the hand held computer is disconnected. The application can issue requests even though it is disconnected. 3. The application can temporarily close down. After issuing the request to the support-layer the application can close down 2. When starting up again, it can reestablish the previously intercepted connection and check to see if the response has arrived. 4. The server doesn t need to be changed. The agent transforms requests between the asynchronous and synchronous model. The agent executes the request synchronously against the server on behalf of the client. The agent also translates responses from the server to the asynchronous model. This means that the client can execute requests against eg. servers on the web that expect synchronous requests. 5. The responsibilities of modules are clearly separated. 2 This requires that the modules are implemented in separate processes or threads

74 6.2 ANALYSIS OF OVERALL APPROACH 62 The disadvantage of this system is that extended periods of disconnections can generally not be supported. If the application has some time limits on receiving the response that is more tight than the time of disconnections then this model is not adequate General issues when using local server data and logic If the hand held computer is disconnected for extended periods of time, use of asynchronous programming is not enough. Time constraints in the application logic might not be met. Even if disconnections are of short duration, these might be unacceptable for some applications. A method for solving the problem is to replicate server data and logic to the client. In this case, we can make synchronous invocations locally. The cost is synchronization problems with the central server. If we change local server data, we might have to synchronize the changes to a central server in a lazy fashion (when reconnected). If many clients access and change the same data concurrently, we can face consistency problems. The server data cached on a client might become outdated (cache coherency). In addition concurrent updates might conflict when updated to the server. A pessimistic lock-based strategy can be used to avoid conflicts. However, if disconnections are extensive and the granularity of server data is big, this solution is unacceptable ([Kistler 1992]). There are many different optimistic schemes for shared access to replicated data. For example, data consistency can be ensured by server call-backs, but this approach is not acceptable when clients are characterized by often disconnections. Another approach is to operate with multiple copies on the server (Lotus Notes). Yet another approach is to check on demand. We believe that if disconnections can not be avoided, any scheme should be client initiated. Data that is only read can be updated on demand (when used), but if we do not know when we are disconnected, the best approach is hoard updated server data by periodic calls to the server (when connected). For data that is written, we think the best approach is to ensure data consistency when replicated data is written back to the server. Any conflicts have to be solved in an application depended way Mobile object model Mobile objects is a model that allows fine granularity of server data and logic to be downloaded to clients. Changes made on locally downloaded objects can be synchronized to master copies of the objects on the server by write back of the complete objects or by synchronizing with change logs. Rover is an example of a system that use mobile objects to support for disconnected

75 6.2 ANALYSIS OF OVERALL APPROACH 63 operation([joseph 1996], [Joseph 1997]). A general model of mobile objects is presented on figure 6.5. Figure 6.5: Local objects model with lazy synchronization with server objects. The synchronization messages send between the client and the server should be send using a communication layer that multiplexes all messages onto a single connection. The model has two levels of abstraction: The transport of objects and the invocation of objects. The transport of objects can be done by a simple client server protocol where the server is a simple data(object) server. The invocation of objects is more complex. If a client copy of an object is invoked, every change to the object should be synchronized to the server copy. If the object is under the control of a support layer, any innovations of the object can be intercepted and a change-log can be build for synchronization with the server copy. Rover use this technic [Joseph 1997]. Alternatively, we can let the objects them selves have the responsibility to synchronize with the server copy. We will return to this solution in section 6.3.3, where we will show, that we can build a simple version of the mobile object model on top of the internet-type client-server model. However, the mobile object model is more elegant when used with an object oriented client server model (like RMI) Client-proxy-server Many servers are not structured as a collection of objects but are split in server logic and server data (in a database). In this case, we can replicate the server to the hand held computer. In this thesis, we call this type of server a proxy. We might replicate a complete server or only a lightweight version with only a subpart of logic and data. A number of systems use

76 6.2 ANALYSIS OF OVERALL APPROACH 64 this approach ([Kistler 1992], [Douglas 1995], [Oracle 2001b]). In some systems all replicates of the server are (almost) equal ([Douglas 1995]). For simplicity and efficiency we will assume that a single server on the fixed network function as a master and that changes on all other replicates are only tentative ([Gray 1996]). The proxy and the central server synchronize changes. In principle, the application can be kept totally unaware of disconnections. The applications make synchronous invocations. The invocations are intercepted by the proxy that services the requests and synchronizes with a central server lazily. Typically, the proxy and the server use logs with changes as a part of the synchronization algorithm. Synchronization may be done simply by replaying a proxy log on the central server. If we use an optimistic strategy for accessing shared data, changes in one log may conflict with changes in another log. In this case, the conflicts must be resolved. Some systems use a general conflict resolution scheme, others merge the change logs and use application specific conflict-resolution algorithms ([Douglas 1995]). If systems support disconnected operation for extended periods of time, some log reducing algorithms can be used. The server might be a simple data server. The offered interface may only include read and write operations. File servers like CODA([Kistler 1992]) and database servers like Bayou([Douglas 1995] ) and Oracle Lite ([Oracle 2001b]) are examples of such systems. The server can be a more complex server with considerable logic needed to operate on data (for example e-commerce web-server). Notice, that complex applications can often be build based on simple data servers. A distributed calender application can be constructed of the following modules 1. Central server application with access to a central calender DB server. 2. Client applications on mobile nodes with access to local replicates of the calender DB server. 3. A synchronization mechanism between the DB servers. In this case the state of the central server is complectly reflected in the data in the DB. Synchronization can be done between the databases. In cases where the state of the central server is not completely reflected in the DB-data there has to be a synchronization between the local proxy server and the central server (assuming that a proxy server model is used and the synchronization is not done from the application to the server). If only a part of data and logic is replicated, there should be a mechanism to handle requests that can not be serviced locally. 3 One strategy is to raise 3 For simplicity we will assume that we are using a simple data server.

77 6.2 ANALYSIS OF OVERALL APPROACH 65 an exception. Another is to block and forward the request to the central server. In case of a read request, we can chose to cache the data locally. In case of write request, we can chose to retrieve the changed data and cache it locally. In either case, we might have to evict some other data on the proxy in order to reduce the resource consumption. To avoid data misses, we can use advanced algorithms that estimate future use-pattern and prefetch the data ([Kistler 1992], [Kaashoek 1995]). The granularity of data is an important parameter in systems where only part of the data is replicated to the proxy. On figure 6.6 we show a general model for a client-proxy-server. Figure 6.6: Client-proxy-server model. The proxy server and the central server synchronizes changes. The application issues synchronous requests that are intercepted by the proxy. Change logs are usually used for synchronization between the proxy and the central server. The model is more general than the client-agent-server model discussed in section The model offers support for extended periods of disconnections. However, the client-proxy-server model assumes that the proxy and the server know how to synchronize changes. The client-proxy-server model takes up more resources on the hand held computer than the client-agentserver. If disconnections are short the client-agent-server model might be adequate Our approach Our aim in this project is develop a general client-server support system. In section we proposed a client-agent-server design based on a MOMcommunication layer. With this design we could support an asynchronous

78 6.2 ANALYSIS OF OVERALL APPROACH 66 internet type client-server protocol. In the previous two section we presented two traditional approaches that is used in the literature to support for extended periods of disconnected operation. In this section we will develop our approach. Complex servers and object servers We see the internet type client server model as a general client-server model. The model is generally identified with the http-protocol. The http-protocol in turn are often identified with the use of the protocol in web-browsers. However, other applications than browser applications can use the httpprotocol. We will divide servers using the http-protocol into two types of servers: 1. Complex servers 2. Object servers. In complex servers the server logic and data is extensive. When using a hand held computer the requests has to be send to the server. Replicating server logic and data to the hand held computer in order to support extended periods of disconnections is not possible. In this case using the asynchronous internet type client-server design we proposed in section is the best support for disconnections we can offer. Complex servers and the thin client model are compatible. In objects servers an application download objects from the server. The GET method can be used for this. The object can be a picture, a dataobject or an application. The internet type client server protocol is only used as a transport protocol. By using the PUT-method an application can write-back objects to the server specified by an URL (The PUT method is seldom supported in todays servers). The objects can contain complex server logic and data that can come under the control of the application. Often the server is a combination of the two types of servers: A web-server typically let a client download an object that is actually a form-application. Complex actions on the server can then be initiated from this application by clicking on hyper-links or submit-buttons. There is a parallel between data servers like Bayou and Coda and an internet server that acts as a object server. In both cases the server interface is very simple. In order to support extended periods of disconnections with this type of server we can chose a similar approach: To replicate the object server to the hand held computer. The server logic in this case could be very simple and general. The proxy do not need to know the application

79 6.2 ANALYSIS OF OVERALL APPROACH 67 semantic of the objects it is controlling. It just need to offer read and write access to the objects. If it does not have the objects locally it can forward the request to the central server. The server objects come under the control of the application and are changed locally. In this way the model is also a mobile object model. Figure 6.7: Our client-proxy-agent-server model. A simple and general internet type proxy is used in our system. The proxy can service requests locally from a cache or forward requests to an agent. When disconnected the requests are queued. The agent executes the requests against the server. Applications can also operate for extended periods of disconnections by issuing requests that return server objects. On the client side the objects are either constructed by a content handler in the application support layer or the application construct the object it self from a stream. As we shall show later synchronization of objects can be supported in the system. In this case changes made on objects can be synchronized with the master copy on the central server lazily. The proxy does not understand content and only cache the data encoding the object - not the object it self. Our model We propose to support a general client-server model that support for both complex servers and object servers (see figure 6.7). We propose using a

80 6.2 ANALYSIS OF OVERALL APPROACH 68 simple and general internet type proxy with access to a cache. When used with object servers it can function as a replicate of the object server. When used with complex servers it in general just forwards requests. Some complex servers might be supported by plugging in server logic modules in the proxy but in this thesis we will not examine that possibility. By offering an asynchronous interface the two models can coexists. Applications can also operate for extended periods of disconnections by issuing requests that return server objects and use these while disconnected. The application either constructs the objects from byte-streams or a content handler in the application support layer constructs the object. The object in our general client server system can be encoded in many different ways (for example as serilized JAVA objects or by using XML). On the fixed network side we use an agent to execute requests synchronously against servers. The agent act on behalf of the hand held computer on the fixed network. In the most simple case the agent just transform from asynchronous to synchronous requests. However, the agent can be extended to execute more complex tasks. The agent could for example be used to transform from the internet-type client-server protocol to another clientserver protocol or the agent could coordinate the execution of a single request against many servers. The basic concept is that the complexity in the system should be pushed from the hand held computer to the agent (and the server). Between the proxy and the agent we use a MOM communication layer. The MOM-layer multiplexes multiple requests onto a single connection and can use a transport protocol optimal for the physical media. The MOM communication layer exchanges messages in the back-ground. We have earlier chosen (section 5.1) to build a browser application on top of the model we show here in order to support for a thin client model and complex servers. In figure 6.8 we have shown a model of our system that demonstrates how a thin client system can be build on top of our general client-server system. In this case the application is a browser that executes form-applications that are fetch using the general client-server system. The browser-application it self could in principle be downloaded as an executable object using the client-server system. The browser solution just described closely resembles the WebExpress system after the later was extended with asynchronous request and a cache to support off-line browsing ([Chang 1998]). However, WebExpress focusses on offering optimized browsing in a wireless environment where as our aim is to build general reliable client-server-system for hand held computers that supports disconnections. As a consequence we will build up our system in a more flexible way and also analyze how write-back (and synchronization) of changed objects can be supported. Applications the need server logic and data in order to function during extended periods of time can use this facility

81 6.2 ANALYSIS OF OVERALL APPROACH 69 Figure 6.8: Client-proxy-agent-server model with thin client applications. This figure demonstrates a potential use of our general client-server system. The application can be a browser that executes form-applications that are fetch using the general client-server system. The Browserapplication it self could in principle be downloaded as an executable object using the client-server system. to access server objects (see figure 6.7). However, it should be noted that we think that there are many unsolved technical problems in this kind of systems. Projects like Bayou ([Douglas 1995]) has shown that replicating and sharing data in a disconnected environment lead to update conflict problems that can only be reconciled using complex application specific schemes. The practical usefulness of such complex systems is small in our opinion. As a consequence we will only analyze support for write-back for completeness. Our focus will be on thin client system showed in figure 6.8 that does not involve write-backs. In addition we will incorporate reliability measurers in the system in chapter 7. Thus, to some extend our system can be seen as a synthesis between WebExpress and Rover [Joseph 1996] made dedicated for hand held computers. It is our aim to keep the proxy as simple as possible and yet as general as possible. No server application logic should be needed. The complexity should be placed in the agent and the server.

82 6.3 ANALYZING THE PROXY Summary - Overall approach We have shown that short disconnections can be handled by replacing low level protocols. Longer disconnections can be handled by introducing asynchronous communication. Because most of the servers can not be changed to use the asynchronous communication, we introduce an agent that has the responsibility for translating asynchronous to synchronous requests. The asynchronous communication can support intermittent connectivity. To support longer disconnections, we need a proxy that can use a cache to service client requests. We proposed a simple and general client server support system that includes an application, a proxy, an agent, and a server. Proxy and agent communicate using MOM communication layer. The clientserver support system can be used to fetch objects into the application that can support extended periods of disconnection. We showed how a browser application could be build on top of the client-server system. The browser application could request form-application objects that can be used to issued request servers that result in complex actions. The browser application could it self be fetched by using the underlying client-server system. 6.3 Analyzing the proxy After having chosen the overall design, we turn to analyze the details in the solution. We start by analyzing the proxy. The proxy receives requests from the application (Through the application support) and uses the message-oriented-middleware to communicate with the agent. The complexity of the proxy is determined by its responsibilities. We try to keep the proxy as simple and general as possible in order to minimize the resource consumptions on the hand held computer. The proxy which we have chosen is a general internet type proxy that contains no server application logic. The proxy simply controls a cache. If the proxy can not service the request, it forwards the request to the remote server. However, a number of problems have to be addressed: 1. Support for complex servers. In contrast to simple data server, a complex server is a server that performs additional computations (that may have side effects) while servicing a request. In this section, we will analyze the requirements for a proxy interacting with such a server. 2. Support for simple object servers.

83 6.3 ANALYZING THE PROXY 71 Support for PUT and GET method in object servers. object synchronization is discussed. Support for 3. Requirements for downloaded objects. 4. Transformation between request/response and messages. We have to pair request and response messages with the correct request connection. 5. Encoding of request in messages. 6. Cache coherency. 7. Cache management. 8. Handling outdated requests. 9. Finding the agent. To what extend should the proxy understand the request. 10. Asynchronous interface to application. Should polling or callback primitives be used. 11. One-way requests. 12. MOM primitives. What MOM primitives should be used Support for complex servers The server is often a complex server. When a request is received, the server often performs complex actions. These actions demand complex server logic that is not present in the proxy. Any actions have to be made by the remote server. Consequently, the proxy just queues the request for remote execution. In some cases the request can be serviced locally, however. In the internet type client server protocol, the actions are defined by parameterization of the request(see section 4.4.2). The request is defined by the request method and a parameterized URL 4. An example of a typical request to complex server is the request that is send as the result of submitting data from a form-application in a browser. The user has filled out and submitted a form. The browser collects the input data in parameters and places them in the URL. In case of the Http-protocol the POST method is generally used, but GET (and PUT) could in principle also be used 5. It is the way the server interprets the request that is important. The server can 4 In the POST-method we can also attach data 5 POST actually also allows data to be submitted as an attachment

84 6.3 ANALYZING THE PROXY 72 now make complex actions and generally return an object as a result of the actions(eg. a web-page). In some cases, it makes sense to cache the returned object: For example, if the form was a search-form and the object returned a web-page with hits, we could probably reuse the returned object next time we submitted the same request. In this case, we do not need to contact the server. In other cases, it does not make sense to cache the returned object. This is the case if we can not expect to get the same result when we submit the same request again. In addition, some requests have to be executed on the server. Suppose, we submit a request that results in a server action that updates a counter and returns an object (a web-page) containing the value of the counter. If we issue the request in order to update the server counter we have to forward the request from the proxy to the server. Serving the request from the cache would be an error. If we have issued the request to get the counter value, the cached object might be adequate, though. The application has the knowledge about whether a cached result would be adequate or whether the request should be forwarded to the server. As a consequence, we have decided to allow the application to send a parameter to the proxy indicating whether or not a given request can be serviced from the cache or should be forwarded to the server. We also let the application send a parameter to the proxy that decides whether or not the result of a request should be cached. In order to find entries for cached objects we identify objects with the request that they were a result of. We use a key that is a hash of request-method + url (with parameters) + extra data (like POST-data) An example could be GET Notice that when complex servers are involved the cache can only be used for reads because the general proxy does not know any semantics of the server. As mentioned before one approach would be to allow server plugins in the proxy. We will not examine this solution in our thesis. We have chosen to analyze support for object download instead. By operating on downloaded server objects the application can potentially handle long periods of disconnections Support for simple object servers With small extensions, we can make our general proxy more powerful. If the server is only an object server (data server), the request interface can be very simple.

85 6.3 ANALYZING THE PROXY 73 Taking the http-protocol as an example, we can use the GET method to read objects (addresses by URL) and the PUT (or POST) method to write back objects to servers. A GET-request can be serviced locally if the object is in the cache. By identifying the objects in the cache by their URL, supporting GET-request should not present any problems. We will discuss how to assure cachecoherency later (section 6.3.6). A PUT-request can be supported under certain circumstances. The problem is conflicting write-backs due to concurrent access by multiple clients. In our model, the PUT request is called from the application. If the application has read an object with the GET-method and changed the object, it might want to write the changed object back to the server. Synchronization can generally be done by sending the complete object or by sending a log describing the changes. If the object is big a change log is a good solution. If the object is small and the number of changes is big, sending the object is generally preferable. Whether to send the updated object or a change log depends also on other aspects. If we allow multiple clients to access the same server data concurrently, they may make concurrent updates to the object. We could avoid the problem by using a pessimistic lock-based approach. However, if disconnections are extensive and the granularity of server data is big, this solution is unacceptable [Satyanarayanan 1993]. Some systems signal clients with call-backs when a cache object is accessed. In this way, conflicting updates can also be avoided. However, call-backs are not a good solution in case of often client disconnections. We believe that any scheme that involves disconnections should be client initiated. We could also allow a semantic where the last client that writes to a shared variable overwrites the other updates. In this case, we need to transfer the complete object (otherwise we might mix updates). We find this method a bit brutal and will not use it in our design. Another approach is to operate with multiple copies on the server if clients use shared data concurrently. In this case, we can synchronize with both a change log and the complete updated object. We do not think this technic solves the problem. If we use an optimistic approach for data access and build a log during disconnection we can synchronize by sending the change log and try to merge the change log from the clients accessing objects concurrently. Some updates might be commutative [Gray 1996]. If updates conflict we can try to solve the conflict in a standard way or by using application specific semantic [Douglas 1995]. It should be clear from the discussion above that the PUT method should contain the applications change log for the object. This implies that the server should be able to understand the change log. There is a problem with the proxy in this case. Because the proxy in our model is only a simple

86 6.3 ANALYZING THE PROXY 74 object server that does not contain any logic for executing a change log on an object, the PUT method should contain the complete object in binary form so the proxy can update the cache. This means that the application has to include both the object itself and its change log when it uses the PUT-method. The proxy can intercept the PUT-method and extract the object. By using the URL, proxy can identify the correct entry in the cache. Then, the proxy can forward to the server the PUT request containing at least the change log. Notice that the copy of the object in the proxy-cache is always only tentative. Even if the proxy has just fetched a copy, a PUT-request to the server from another client may immediately make the copy invalid. We will return to this problem when discussing cache-coherency (section 6.3.6). In this context, we will only note that when we issue a PUT request and write the updated value in the cache, this value may be invalid, due to possible conflicts with other clients change log. We can reduce the risk of invalid caches by returning the updated object to the proxy after a PUT-request. If the tentative update written in cache was invalid, the error can be corrected. We can not wait to update the proxy cache until the PUT-request returns because the client might disconnect immediately after issuing the request leaving the application copy and the cache copy inconsistent. If the application is closed down and started again the object can fetch in the proxy will not be correct 6. Because the object might change after being submitted to the remote server the asynchronous PUT request could contain 4 options on how to return from the proxy: 1. The PUT request could return immediately without any return value. 2. The PUT request could return immediately but the proxy could offer an test primitive to the application that signaled when the PUTresponse returned. 3. The PUT could block until the response from the server returned. 4. The PUT could register a call-back in the proxy. We think the first option is adequate. The GET method could be used to test periodically if the value of the object has changed. The second solution would be of little value. The 3. solution would not be good if disconnections are often experienced. The 4. solution is more complex to implement and if the application close down for a while the proxy might have problems 6 The cache is kept on persistent storage while application data is generally kept on volatile storage. The cache plays a part in the fault tolerance of the system

87 6.3 ANALYZING THE PROXY 75 delivering the call-back. Instead, an optimistic approach must be uses in application programming. Notice that the server should be able to understand the semantics of the PUT-request we have defined above. If we can live with a semantic where the last PUT-request wins, the system can be made considerably more simple. In this case, we just include the object in binary form in the PUT. The GETmethod can generally be supported without any control over the server. Different addressing in cache We are now left with a problem of different addressing. When dealing with complex servers, we address the objects in the cache by the requests that resulted in their retrieval. When deling with simple object servers, we address objects by their URL. If we chose to address objects by the requests that resulted in their retrieval, we can access objects obtained by GET request. However, this solution is not satisfactory. The basic problem is how the proxy can differ between PUT requests made for object servers and PUT requests made for complex servers. The proxy actions would differ in this case. One solution is to say that PUT-methods are only allowed when object-servers are involved. The proxy will know that the object was originally retrieved with the GETmethod and the object can consequently be accessed in the cache by the key hash(get url). The PUT-method is actually seldom supported on web-servers (Sometimes for security reasons). Another approach is to send a parameter from the application to the proxy specifying whether the PUT-request should be understood as a request to a complex server or to a data server. Finally, we can choose to make a special PUT-method for data-servers (for example PUT-DATA). When we want to synchronize changes on objects, we can generally assume that the server will apply to any interface we chose. The internet-type client server protocol is not limited by the http- protocol. Because the http-protocol is so widely used, it would be an advantage if the discussed technics can be handled within the semantic of this protocol. We chose to address all object in the cache by the request that retrieved them. We also chose to let the application send a parameter to the proxy that specifies whether the server should be treated as an object server or a complex server.

88 6.3 ANALYZING THE PROXY Requirements for downloaded objects In the scheme presented above the application must build up the change log and synchronize changes made on the object to the central server using a PUT-request. The server has the responsibility to replay the change log on the master copy of the object and reconcile any conflicts. In order to minimize the work needed by the application we can let the objects synchronized them selves. One solution would be to incapsulate the change log in the objects and let each public method write to the log when they are invocated (including all parameters in serililized form). There might be some restrictions on what kind of parameters that can be used and what kind of invocations that can be made on the objects. The objects could issue synchronizing PUT-requests each time a public method is used or when the application called a special commit-method on the object. The server should construct the objects so that they meet the interface described above. The objects could address the proxy by using static addressing or by receiving the information from the application (for example in the Content Handler in client-server protocol interface lib) Transformation between request/response and messages When the proxy receives a request from an application, it has to be able to identify the request and the application issuing the request. The proxy needs these information for later delivery of the response. If the request is made trough a loop-back TCP/IP connection, the information is kept in the connection. In this case the connection has to be kept open until the response returns. If other communication methods are used, the information must be parsed differently - for example as parameters. We want to support that the application can close down after issuing the request and to fetch the response when it restarts. This feature is especially useful for supporting fault tolerance (See chapter 7). In this case, the information identifying the application and the request must be parsed as parameters even if a TCP/IP connection is used. 7 In this analysis, we will assume that the information identifying the application and the request is communicated when the request is issued. Of course, this makes some demands on the application support lib that the application uses. By keeping this information in a table (for the duration of the request), we have a method for linking the in and outgoing MOM-messages with requests. In addition to a unique ID 7 The application could be identified from port numbers if different ports numbers were statically assigned to different applications. We would, however, still have to identify the correct request.

89 6.3 ANALYZING THE PROXY 77 identifying the message, each MOM-message usually also has a correlation ID for identifying messages that are connected (for example for sessions). By placing the request ID in this field, response messages returning from the agent can be paired with requests Encoding of request The requests and responses are packed into MOM-messages. It can be discussed whether or not the requests and responses should be compressed. If we want to use differencing technics that are based on application knowledge, these have to be implemented at this level. Because we have chosen a simple general proxy that does not understand the objects it caches, this type of differencing is not possible. We could implement differencing on a binary level [Coppieters 1995]. The objects in the cache could be used as base objects. On the other side of connection we could use a process responsible for synchronizing objects. Note, the similarity to agent design. After conducting short tests, we have concluded that binary differencing is too processor demanding to be realistic on hand held computers. Other compression technics (for example zip) can easily be implemented in communication layer (MOM-layer). There is consequently no reason for compression algorithms in the proxy. We chose to write the request from the application as text in the MOMmessage Cache coherency As mentioned before, server data cached on the proxy can become outdated. If the data is an object encoding software (for example a form-application), new versions of the software might be placed on the server. This will normally not happen often. If data is objects that are accessed and changed concurrently by many clients, the cache on the proxy might become out of date. One solution is to let the proxy contact the server periodically to check for changes [Kaashoek 1995][Housel 1996]. The coherency interval could be a fixed parameter in the proxy but could also be specified for each object by the server or the application. In case of a fixed interval, a single coherency check could be issued for a number of cached objects owned by the same server. A simpler solution would be to set a time-to-live on each object. If the time-to-live expired the cache entry could either be deleted or refreshed.

90 6.3 ANALYZING THE PROXY 78 This could be done in a back-ground process in the proxy 8. Alternatively, the procedure could be conducted when the application issues a request. We could also piggy-bag notification of object changes when clients make request to the server. This implies that the server keeps track of which clients have copied which objects. Alternatively, the server could notify the clients every time an item is updated on the server (call-back). Because clients can be disconnected for long periods of time, many call-back messages might queue up for a client. Generally, it would be better if the update was client initiated and if the server would not have to administer anything. If the server is a web-server out of our control we can not expect the server to participate in cache updates. As a consequence, we think cache-coherency should be done transparent to the server. We chose to support periodic based on time-to-live (TTL). The TTL shows how long an entry is valid. When the TTL is passed the entry is either deleted or refreshed. Checking TTL and refreshing can be done be a separate thread or when the cache is accessed during requests. We will return to this subject later (see section 6.3.7) Generally, we want to avoid any unnecessary logic on the proxy. As a consequence, we do not put the logic necessary to construct requests on the proxy (This logic is placed in a application lib). Instead, we save the request that resulted in the cached object. By replaying the request, the cache entry can be updated Management of cache Hand held computers have limited storage resources. Therefore, it is necessary to manage the cache and prevent it from exhausting system resources. Generally, we prefer to avoid cache misses when we are disconnected, because cache miss can only be handled by queueing the request for remote service 9. When we have limited storage space we must optimize the content of the cache. However, the hand held computer has limited processer resources. Consequently, algorithms for cache management should be simple. Control of cache size The storage resource consumption can be controlled be statically limiting the cache to a certain size. Alternatively the cache size can be made dynamically depended on available memory. For simplicity we chose to statically assign 8 Later we will present how to manage outdated cache entries 9 Some systems treat cache-miss as an error and just raise an exception [Kistler 1992]

91 6.3 ANALYZING THE PROXY 79 a maximum size of the cache. The cache is not the only user of memory in our system and in order to be sure that adequate memory is available for each module we chose a simple static administration of memory. Control of content Objects will have different importance for the application. Some objects are so important that the application will always want them in the cache. 10 Other objects should never be cached because the objects only make sense in one request. If the server does not act as a simple object server but conduct more complex actions on server data this is often the case. We have already described that the application can send three parameters to the proxy that determines proxy actions. Recall, that one of these was a parameter that determined whether or not the result of a request should be saved in the cache. This parameter gives some control over the content in the cache. However, we will extend this parameter-set with a parameter that determines a priority of the object. The highest priority should indicate that the object should always be in the cache and refreshed every time the TTL expires. The priority could be used to evict old entries when the cache is full. The priority could also be used to upgrade entries if often used. In this way hotspots can be dynamically upgraded in a simple way. An entry should not be able to reach the highest priority this way. We think that this decision should always be controlled by the application. The priority scheme could be combined with a FIFO strategy. In case of eviction, the search strategy could be to find the oldest entry in the cache that has an overdue TTL and a priority lower that the request. 11 Some system use complex algorithms to predict user behavior in order to determine what entries should be kept in cache, and in some cases what entries should be prefetched [Kistler 1992]. We think that these types of algorithms are to heavy for hand held computers. We will rely on the methods described above, that give the application control over what is kept in the cache. Because some objects may be very important to the application, we have considered support for hoarding of application specified objects. Vital objects could be named in a database and hoarded when the proxy starts up. This implies that the proxy should contain logic necessary to construct requests. We have decided against this solution. The application can easily 10 The application might also demand that the objects are always updated as discussed in section Methods for cleaning the cache will be presented later in Managing outdated cache entries on page 6.3.7

92 6.3 ANALYZING THE PROXY 80 make the necessary request itself at startup. If the objects are form-based applications, they usually also need some user actions. Managing outdated cache entries Some data only have meaning to be stored in a limited time. For that purpose we can use the TTL (time to live) field. The time to live can be either a number of times a data can be viewed (indicates that data are dynamic and should be re-fetched while re-accessing), or a duration of time in which the data are valid. The validation of cache entries can either happen by running a thread that performs the control, or during every search that is performed in cache. Having a dedicated thread, may clean the data as often as it is run. The disadvantage of this solution is, that it is consuming processing power. We can, of course, schedule the cleaning process, but then we risk that data will be outdated in the time between the schedules, and hence need to be checked during connection. The other solution cleans the cache while servicing every request. When searching the cache for an entry that matches the request key, we can as well clean all the entries that are outdated. When we find an entry in the cache, we stop the search and all the outdated data which have not been checked remain in the database. On the other hand, when we search the cache for request keys, that don t exist in cache; we may check and clean all the database (depending on database design). Anyway, every database lookup needs to include a check of TTL, and in this way the entries get validated. In periods with low request-activity we may chose to run a cache cleaning thread Handling outdated requests An application might issue a request that is communicated to the fixed network just before a disconnection. The hand held computer might be disconnected for so long time that the client lose interest in the response. An application may also close down for an extended period and lose interest in a response that has arrived in the client-server support layer on the hand held computer. The hand held computer might also be permanently damaged and may not be able to collect the response from the MOM communication layer on the fixed network. In all these cases we need a mechanism that will alow us to clean-up outdated requests and responses. The problem is analogous to the cache problems discussed above.

93 6.3 ANALYZING THE PROXY 81 We will solve this problem by giving each request (and response) a time to live (TTL). The TTL is controlled by the application. The TTL is is a expiration time that is pared through the system. This means that a message can time out on the agent, in the MOM-communication layer or in the proxy. This solution implies two things: 1. Each module should have an mechanism to remove requests or request messages that has expired. 2. Each module should have access to a global time. Using the global time each module can clean up outdates messages and the proxy can at the same time remove its state related to the request (support for asynchronous primitives). The proxy should drop any messages it does not expect. A global time can for example be achieved by using logical clocks and synchronize these during communication. Time servers could also be contacted periodically to synchronize clocks. In a system with disconnections we expect the TTL to be considerable larger than the maximum bias between clocks in such a system Finding the agent The proxy has the responsibility to forward the request to an agent that can execute the request against the server. Potentially, the requests can be using many different types of protocols and the agents can have different capabilities. As a consequence, we need to chose agent for each individual request. We can chose to let the proxy look in the request header to discover the request protocol, or we can chose to let this information be parsed as a parameter from the application support. We chose the last solution because this implicates that the proxy becomes a general proxy for all protocols that do not need any logic for parsing request headers. The application support constructs the requests with all the necessary parameters, because the information about the protocol is known in the application support logic. In conclusion, the proxy just forwards requests to an agent that is looked up based on the information about the used request-protocol (for example http). The protocol is parsed as a parameter from the application support. The proxy needs to know the address of the agent in order to construct the MOM-message. This problem is a general addressing problem: We can chose to hard code addresses, the agent address can be discovered by an addresslookup service or the agent can be discovered by broadcast-methods. Any of the methods can be used in our system as long as they are client initiated.

94 6.3 ANALYZING THE PROXY Asynchronous interface to application It is clear that the interface to the client server support system has to be extended with an asynchronous interface. The application support defines the client server systems interface to the application. We have already discussed the synchronous interface in section In this section, we discuss how to extend this interface to support asynchronous invocations. We can chose between offering polling or call-back primitives. Because we want to support that an application can close down for a period of time after issuing a request, a call-back solution is therefor not adequate. Consequently, we chose to use polling. We do this by adding a test-primitive to the existing primitives. This test-primitive returns true immediately if the response on a request has returned to the proxy. By offering this simple primitive, control and responsiveness is kept in the application and we believe this is the best solution. However, the interface can easily be extended with a send or test primitive that blocks for a user defined period of time One-way requests Some requests may not need a response. The application might just want to submit some data to the server in a request without any return parameters. We are building a client-server system with delivery guarantee. In principle we do not need a response to be sure that the request was delivered if we assume that the server is always ready. However, because we do not include the server in our reliability model (see section 7.1) we can not be sure that the server is running. In this case we can use a (empty) server response as an acknowledgement. If the application is interested in the response as an acknowledgement it can fetch it from the proxy. Otherwise it can let the response time out MOM primitives used The proxy communicates with the agent through the MOM system. A MOM system is generally passive. Both agent and proxy have to contact the MOM and ask to receive messages or to send messages. Sending messages does not present any problem, but receiving messages does because it can not be known when the messages arrive. By using a call-back primitive, the MOM can be made to contact the receiver when the message arrives. Generally, it is also possible to block on a receive call to the MOM system. We have chosen to use polling when the proxy receive messages from the MOM system. We have done this mainly because implementing call-back

95 6.3 ANALYZING THE PROXY 83 in MOM takes some more effort than polling. In addition, we have chosen polling from the application to the proxy. By choosing pooling from the proxy to the MOM-system, we open the possibility to give the application full control over when response messages are processed Summary - Proxy The proxy is kept as simple as possible and contains no application logic. We address entries in the cache by a hash of the request that lead to the retrieval of the data in the entry. We have analyzed how write-back of changed objects can be supported. We do not need this feature an will not investigate this further. The application can use the following parameters to control proxy behavior like cache content, cache coherency level and request handling: 1. Cache response. This parameter determines wether or not the proxy should cache the response for the request. 2. Service from cache. This parameter determines wether the proxy is allowed to service the request from the cache or wether the proxy should be forced to forward the request to the remote server. 3. TTL. The application can set an expiration time on a request. If the TTL expires before the application fetch the response from the application support layer all traces of the request are removed from the proxy (and the other modules in the system like the agent and MOM-communication layer). 4. TTL Cache. This parameter let the application set the TTL/refresh interval of the response in the cache (Cache response should be set). 5. Cache priority. This parameter let the application determine the priority of the response in the cache. The highes priority can not be evicted but is always refreshed with an interval defined by the TTL. 6. Object/Complex server. This parameter instructs the proxy how to interpret a PUT-request. If we use the system with an object-server a PUT request means that a changed object is written back to the server. In this case the proxy needs to update the object copy in the cache. 7. Request ID. This parameter identifies the application issuing the request and the particular request.

96 6.4 ANALYZING THE AGENT 84 The application is offered a test-primitive so that it can poll the proxy for the response to a request. In the first use of the primitive, the request is send to the proxy. In subsequent uses the proxy is polled. We allow the application to close down during the client-server support systems execution of the request. For this reason we do not support call-back primitives in the client-server interface. If the application only calls the test-primitive one time it is equivalent to a one-way request. The application do not have to administer the request. When it times out the client-server support system removes all data structures related to the request in the proxy. The proxy use the Request ID in the transformation between requests and MOM-messages. The Request ID is used as the correlation ID in the MOMmessage and is used to pair the response-message with the request. We encode the MOM-messages in text. We do not use any compression technics before packing the messages in the MOM-messages. When the TTL for a cache entry expires it is either deleted or refreshed. Refreshing is done by reexecuting the request that lead to the response in the cache entry. We detect expiration and clean the cache during the use of the cache. However, we also plan to use a special clean that ensure cache coherency during periods with low activity. The cache size is fixed. The cache priority and TTL determines witch entries that has to be evicted if the cache is filled up. Based on the used client-server protocol (The proxy retries this information form the request header) the proxy choose an agent. 6.4 Analyzing the agent Agents are responsible for executing client requests on the on-line side. They can vary in complexity. They can stretch from a simple repeater, that adapts asynchronous requests to the synchronous protocols, to advanced agents that can perform complicated requests, or execute a whole application on the server side. Earlier in the analysis (section 5.3) we have presented agents that are responsible for adapting the responses to the thin client architecture. We can assume that agents are present on different levels of abstraction: on the communication level, they can perform network specific task (like differencing), on the application support level, they can be used to translate asynchronous to synchronous requests, on the application level, they can act as servers contacting other servers.

97 6.4 ANALYZING THE AGENT 85 In this section, we will present agents that adapt clients to the disconnected environment. We will start by analyzing a simple agent of the type we presented in the analysis of our overall approach (see section 6.2.7). Then, we will analyze more complex agents Analysis of a simple agent The responsibility of a simple agent is to service the client application with the proper server responses. The agent must connect the client requests with the server replies i.e. maintain the proper request - reply data. Every time a request from a client is fetched, the agent executes the request and returns the result back to the sender. When analyzing the agent, we will address the following problems. 1. Fetching a message from the communication layer, 2. Unpacking the message, 3. Executing request to the server, 4. Generating a response message, 5. Delivering the response back to the client. Fetching a message from the communication layer The agent can communicate with the asynchronous communication layer (MOM layer) by polling, call-back or by blocking calls. In one of these ways it can fetch the messages that are sent to it or wait for the coming messages. Unpacking the message The fetched message will be unpacked and its content will be used to create a request, that is a copy of the request issued on the client. If unpacking was not successful the agent can inform the client about the cause for the problem (for example wrong format, missing data,...) When the agent receives a request from MOM it has to be able to identify the request, the target and the receiver. As already mentioned we use an ID-field(correlation ID) in MOM messages to pair request and responses in the proxy. This means that the agent

98 6.4 ANALYZING THE AGENT 86 must extract this ID from the MOM message and write the same ID in the response MOM-message. In the MOM message, the receiver is identified. The receiver is the proxy. The proxy, in turn, knows to what application it should forward the response. The target of the request will be included in the request-header in the form of an URL. The agent just has to extract the URL and issue the request. We have assumed that the agent has been chosen based on the request-protocol (for example http) that we want to use. Consequently, the proxy will include logic necessary to parse the request-header. Executing request to the server The requests are usually executed synchronously by the agent. The agent blocks, waiting on the server s response. During the execution, the agent can meet several types of events: Server delivers an answer, Server inaccessible, Network error. The execution of the requests on the agent may be controlled by the client. The agent may be instructed to act in a certain way in case of errors (two last examples). In case of network error or server inaccessibility, the agent may return an error message to the client immediately or wait for some time and try to re-connect later. We have chosen to let the application send a parameter to the agent that informs it on the policy. When the server can t be contacted we can: return immediately with an error message, or to try to contact the server periodically for a number of times before returning with an error message Since a request is expected on the client side, the agent always generates a response. Generating a response message The response may either be a response from the server or a status info to the client. The correlation ID that was received in the request-message is placed in the response message.

99 6.4 ANALYZING THE AGENT 87 Delivering the response back to the client Delivering happens using the asynchronous communication layer Analysis of a complex agent The complexity of the agent is determined by its responsibilities. The simple agent which we have discussed has only responsibility for fetching requestmessages and executing these synchronously. In this section we will present: 1. Agent executing complex request (application agent) 2. Protocol transforming agent 3. Transaction managing agents Application agent More advanced agents may execute whole applications on behalf of the client on the hand held computer. In W4 a browser application is running as an agent for the browser viewer that is executed on the client [Bartlett 1994]. Protocol transforming agent Agents can also be used for protocol transformation. A sophisticated agent may execute a protocol on behalf of the client against the server, transform the communication, and use another protocol to communicate with the client. This transformation may be one message to one message. Preferably the communication to the client should be less verbose than between the agent and the server. This model can be very useful if the server-protocol is a typical on-line protocol that includes extensive handshaking and acknowledgements. A hand held computer experiencing disconnections will not be able to run such a protocol. Even if the hand held computer has weak connectivity such protocols might be too verbose. The solution might be to instruct the agent in one protocol and let the agent run the server protocol. An special kind of protocol transformation can be seen if agents act as gateways. In this project we have analyzed the possibility of making a gateway from http-requests to the CORBA-interface used in Rime (Radiometers Server) and implementing the server application logic directly in this agent. We chose not to do this for several reasons (se section 8.3) but it was clearly a viable way.

100 6.5 ANALYZING THE SERVER 88 Transaction managing agents Agents can also be used as transaction managers. In this case a request from the client might involve contacting several servers synchronously and run a distributed transaction. Only the result (committed or aborted) need to be returned to the client. Common for all types of agents mentioned in this section is that they compensate for the limitations of hand held computers. Especially, the mentioned technics make it possible for clients that are periodically disconnected to work with applications that demand on-line synchronous clients. 6.5 Analyzing the server Our strategy is to demand as little as possible from the server. There are some problems where the server is needed Support for mobile objects If the server wants to support the PUT-method and allow clients to write back changed objects, it should be able to understand the semantic of the PUT-method defined in section If the server wants to support objects that can synchronize themselves, it must send this kind of objects when the client uses the GET-method Advantages of the client-agent-server model When using the simple agent model discussed in section 6.4.1, the server is only responsible for executing the requests. This model is very useful when we want to use held computers with an existing client-server system. If the server is optimized for synchronous invocations and we already have a number of clients on the fixed network that uses the server, it might be costly to change the server to use asynchronous invocations. When using the client-agent-server model the existing system is kept unaffected. 6.6 Analysis of MOM communication layer The MOM communication layer abstracts all network communication between the client and the agent. In this context we will discuss two aspects: 1. Method for exchanging messages between MOM-queues. 2. Handling disconnections.

101 6.6 ANALYSIS OF MOM COMMUNICATION LAYER Methods for exchanging messages between MOM-queues Messages between the MOM-layer on the hand held computer and the fixed network can be exchanged by many methods. Many hand held computers have a standard way to synchronize information. An example of this is HotSync on the Palm-platform. The synchronization method is at batch type synchronization that is useful when hand held computer is placed in a cradle with direct or indirect connection to the network. The synchronization methods work over wireless links too. A special system up-call can start applications if application logic is necessary for the synchronization process. We have chosen not to explore this method further. Our goal is to have an synchronization procedure that works continuously with different levels of connectivity. If the hand held computer is connected, we want any request issued by an application, to be transported immediately to the server. In addition we do not what a synchronization procedure that involves the user (HotSync normally requires the user to press a HotSyncbutton). The synchronization should run in the back-ground. As a consequence, we have chosen to establish direct network connections between the MOM-layers and use standard network protocols to exchange messages in the back-ground Handling network disconnection As we shall discuss later, the application should have some control over the status of the network connection. However, to a large extend the communication layer should control when to connect to the fixed network and when to disconnect. If there are messages in the out-box, it should try to exchange these with the MOM-layer on the fixed network. In addition, it should periodically poll the MOM-layer on the fixed network for messages. If proper hardware support was present on the hand held computers, the connection on low level could in principle be initiated by the fixed network part.a system up-call starting the MOM-layer on the hand held computer could be used to start the exchange of messages. A simpler solution would be to let the MOM-layer on the hand held computer initiate the communication. Generally, we think that the network connection between the hand held computer, and the fixed network should be initiated by the hand held computer. The MOM-layer on the hand held computer can decide to keep the connection open as long as it wants. Any policy can be used here. The type of connection depends on the lower protocols used. The connection can also be a virtual connection. If the MOM-layer on the hand held computer looses the connection, it can try to reestablish the connection. One strategy is to try to reconnect peri-

102 6.7 ANALYSIS OF BROWSER APPLICATION 90 odically. Another strategy is to use an algorithm that tries to connect with larger and larger intervals. 6.7 Analysis of browser application As mentioned before, we have chosen to build an off-line browser application on top of our client server system in order to support thin client applications. In this section, we will discuss following problems: 1. Awareness of disconnection. 2. Browser sessions. 3. Application level cache Awareness of disconnection By choosing a proxy solution it seems plausible that the application does not have to be aware of disconnections. If the proxy can service request locally, the disconnections can be transparent to the application. Because we can not replicate the complete server, and because we have only replicated general proxy with access to a small cache, we can not avoid cache-misses. If we are disconnected, simple read (GET) requests might have to be queued for execution on the remote server. If the requests result in complex server actions, we also have to queue request. As a consequence, the application has to be aware of disconnections. We could use a standard browser and open a new window for each request([brown 1995]). If we are disconnected for longer periods, this solution is clearly not good. Instead, we have chosen to build a browser that offers a special page displaying pending requests. This solution is similar to the solution used in WebExpess for a browser with weak connectivity [Housel 1996]. The information necessary to build the page is provided through a special API to the proxy. Sometimes, the user has total control over the synchronization time. This is the case if the hand held computer is only equipped with a serial cable. In this case, the application (and user) need to know if there are any messages that need to be synchronized to the fixed network. In addition we might want to give the application/user control over when to synchronize. Because of these issues, we have chosen to give the application access to these information through special APIs to the MOM communication layer.

103 6.7 ANALYSIS OF BROWSER APPLICATION Browser sessions Using form applications we may need to keep the session alive. For instance, shopping on the internet requires that user performs requests within some time limits. This task can be problematic while disconnected. A solution for handling browser sessions is to engage the agent and let it perform the necessary requests. The agent functionality was presented in section Application level cache Running browser application on processor weak computers is demanding. Especially parsing and generating the form view showed to be very time consuming. Every time a form is to be showed, it must be parsed. A way to optimize the browser performance is to keep the previously parsed pages in a application cache (a history). When a page is re-requested, instead of servicing it from the application support we can use the parsed form. This feature consumes a lot of memory, but it improves the browsing effectively. We have chosen to keep the history for the previously visited pages, so the user can go back without waiting for the parsing. Additionally we have also added a functionality of going forward in the history. We believe that these two additions make the browser even more usable Control of parameters in system In our system we have given the application control over a number of parameters. These parameters are send to the proxy and determine wether or not the result of a request should be cashed, wether or not a cache entry should be refreshed periodically, how long time an entry in the cache should live and how long a request should live (see section 6.3). In addition the application can send a parameter to the agent indicating wether the agent should return immediately if it can not contact the server or it should retry a number of times (see section 6.4.1). The system parameters give the application a detailed control over the system behavior in general and over the execution of each request. Notice that we have not determined where the application gets the information that it use to set the parameters. We see tree different ways the application can get this information to control system behavior: 1. Client controlled.

104 6.8 SUMMARY - CLIENT SERVER MODEL WITH DISCONNECTIONS 92 There can be some general local policies on the hand held computer or some special requirements for the particular application that should govern system behavior. Using local policies we can decide how long time an application can wait for the sever response, or how often it shall refresh the cache entries. 2. Server controlled. The server might have the information necessary that is needed to control system behavior. For example it is likely that the server has the knowledge about how long a particular object should be cached on the hand held computer. A strategy would be to let the server send parameters to the application that in turn can send the parameters to the proxy. Often, this method is favorable because server that generates responses (forms) knows what the different requests mean and what actions should be taken on them. 3. Server hints. The last method can use some information from server that can be used to run more accurate policies an the client. This method is a combination of the to previews ones. We believe that the optimal solution is to combine all the mentioned methods. The server may know the application logic, but the application logic may vary from one user to an other. For instance the user may not have enough RAM, so he can t cache everything, which is suggested by the server. On the other hand, the user may not know application steps for instance that one request usually takes more time and an agent should wait for the correct answer from the server. The server may run analysis processes and come with the optimal suggestions to the client due to, for instance, caching policies. Using different weights as the hints opens a possibility for making efficient local policies. 6.8 Summary - Client server model with disconnections In this chapter we analyzed how the applications time-constraints determined what kind of support for disconnection that is needed. If an applications actions could be completed in one request this request could just use an asynchronous request-model a queue the request during disconnections.

105 6.8 SUMMARY - CLIENT SERVER MODEL WITH DISCONNECTIONS 93 If a series of requests depended on each other disconnections corse problems. In this case server data and logic might have to be replicated to the hand held computer. We analyzed different approaches for handling disconnections found in the literature: 1. Changing protocol. Changing protocols at low level can handle very short disconnections. If we want to support longer disconnections then we shall use the next method. 2. Asynchronous client-server systems. We showed that asynchronous client-server systems could be used to make the applications actions independent of the communication. In a system with short and often disconnections (intermittent connectivity) this model is very useful. We proposed using MOM as an asynchronous model. An agent could be used to transform from the asynchronous model to a synchronous model on the fixed network. In this way servers that accept synchronous invocations do not have to be changed. 3. Using proxy helps when the disconnections are very long. Many systems in the literature supports disconnections by replicating server logic and data to the client. Replication can either be in the form of objects or as light-weight servers. This model supports extended periods of disconnections but also introduce serious problems. If multiple clients are allowed to change data this can lead to update conflicts. The reconciliation of such conflicts often need application level logic. We finally proposed a model that was structured in an application, a proxy, an agent and a server. The agent and the proxy used a MOM-communication layer to excange requests and responses. We showed how this system could work when the application was a browser. In the subsequent sections we analyzed each component in the model: 1. Proxy. We showed how a simple general proxy without any application logic could be build. We use simple schemes to control cache content, cache coherency and request handling. A time-to-live (TTL) parameter is used for both requests and cache entries.

106 6.8 SUMMARY - CLIENT SERVER MODEL WITH DISCONNECTIONS Agent. We showed how a simple agent that tr5ansforms from the asynchronous model to the synchronous model could be build. A parameter send from the application is used to control the agents request-handling in case it can not contact the server. In one mode the agent returns immediately. In another mode the agent retries a number of times. 3. Server Our strategy is to demand as little as possible from the server. 4. MOM-communication layer We showed that many different methods can be used to exchange messages between the MOM-queues on the hand held computer and the fixed network (For example HotSync and Airdics). We chose to establish a network connection an exchange messages in the back-ground. The MOM-layer can handle disconnections transparent to user. 5. Browser application. We showed that the browser application needs to be aware of disconnections. We use a special page to display pending requests. The browser issue a number of request and can use this page to fetch responses later when they are ready.

107 Chapter 7 Reliable client server systems In this chapter, we will focus on how to ensure the reliability of the system. We will assume the system structure suggested in section 6. The general types of errors and methods for reducing them are presented in appendix B. To reduce the number of errors we divide the system in modules that are recoverable and contain the errors in them. To sustain crashes, the modules save their state in persistent memory. The modules are watched by a watchdog which is responsible for starting the modules. To ensure reliable data transportation, we use the store-and-forward communication method. We structure this analysis in following sections: 1. Fault model for the client-server system. Firstly, we define a fault model for the client-server system. Together with the fault model for our system, we present the level of fault tolerance that we want to support. This is necessary in order to come into details with aspects related to the reliability. We will show that we support recovery from errors on the application support, and communication layer. The application layer will use the application support layer s functionality to recover the unserviced requests. 2. Handling errors in modules. Then, we will discuss how to make the modules reliable, so they can sustain soft errors. This section will include the analysis of what data should be saved in stable storage in order to recover after a crash. We will analyze every layer in our model i.e. application, application support, and communication layer. We will show that application can re-establish the unservised requests by querying the application support layer. 95

108 7.1 FAULT MODEL FOR THE CLIENT SERVER SYSTEM Handling errors between modules Because data are transported between modules, we have to ensure that they don t disappear. In this section we will address the issues connected with transporting data between modules. We will analyze different methods of transporting data reliably. 4. Recovering from errors In this section, we will analyze how to handle module s break down. The analysis will result in introducing watchdogs. 7.1 Fault model for the client server system Errors can occur as hardware or software faults. The errors can appear either on the hand held computer or on the fixed network. In addition, faults can occur in the communication link 1 between a client and a server and between the single layers Level of reliability In this section, we will present a fault model for our system that can handle transient and recoverable errors 2. Most type of faults are transient and recoverable [Gray 1991]. In our system, we want to offer the delivery guarantee of requests and replies. As presented on figure 7.1, the reliability analysis is addressing only the communication and application support layer. 3 We can not make a general support for reliability on the application level. According to the end-to-end argument[saltzer 1994], full support for detection and handling errors can not be made without the knowledge of the application. Our focus in this thesis is to offer a client server support system, and it is not possible for us to know what kind of applications that will use the support layer. The application must use its own technics to ensure reliability. However, we will only offer recovery of invocations delivered to the support system for applications. If an application breaks down before receiving the responses, it will be able do access the pending requests. 1 The communication link will often be a combination of hardware and software 2 In appendix B.1, we have presented general types of errors 3 Later, in section 7.1.2, we will present the argument for splitting the system into layers

109 7.1 FAULT MODEL FOR THE CLIENT SERVER SYSTEM 97 Figure 7.1: Reliable client server model. The underlying modules use persistent memory to store the recovery information. Applications are not included in the reliability model Containment strategy One of the techniques for reducing number of errors is the containment technique 4 which divides software into modules that prevent errors from spreading. Based on the analysis in the preceding chapters, it will be a logical step to divide client server computer software systems in at least three main containment modules: A communication layer, an application support layer and an application layer. Using this overall design we can contain many faults that occur in the link layer inside the communication layer. Software and hardware faults on either the client or server side could be contained inside each of the layers that can handle the error most appropriately. Internal errors should be handled by the layer itself by either masking or exception throwing. Notice that splitting the design in modules also makes the programming task and testing task more easy and therefore less prone to software errors Strategy for guaranteed delivery of request/reponses In our opinion there are two basic strategies that can be used in a clientserver system to ensure guaranteed delivery of request to the server and responses to the client: 4 See appendix B.2, for different error handling techniques

110 7.1 FAULT MODEL FOR THE CLIENT SERVER SYSTEM Client controlled delivery. 2. Point-to-point delivery (store-and-forward). In client controlled delivery, the client executes invocations until they succeed. In case of a fault somewhere between the client and the server, it is up to the client to detect the fault and execute the invocation again. In point to point delivery, the invocation is passed as a token between safe points. By the safe point, we mean modules that can recover in case of faults (for example breakdown). The delivery of the invocation between the safe points becomes critical. By using a transactional delivery protocol a high level of reliability can be reached. In point-to-point delivery a reliable communication channel is in fact created from the client to the server. Because we expect the hand held computer to be disconnected from time to time, a client controlled delivery is not a good solution. If the invocation was lost due to faults in modules on the fixed network, this knowledge would only be available to the client when it reconnects. If possible modules on the fixed network could detect the fault and communicate this back to the client. Some faults might result in the loss of invocations that go undetected by other modules. In this case the client must use timeouts to detect faults. In contrast to the client controlled delivery model, point-to-point delivery is autonomous by design. In case of faults, the modules can recover and proceed with the invocation. Only faults that have application level importance need to be transmitted back to the client. Point-to-point delivery gives good possibility to use the containment technic. Based on these arguments we believe that a point-to-point delivery model is the best design when disconnected operation is involved. The server and the application is not part of our fault model. In general, we can not assume that we have control over the server (It could be a webserver) and we can not include the server in our fault tolerance model. In our client server support system, we will assume that the server is running. If we can not contact the server, we will return an error message to the application and let the application decide what to do Strategy for guaranteed delivery of request/reponses Our strategy for offering guaranteed delivery of requests and responses is thus to divide the system in modules that contain faults. Each modules has the responsibility to recover itself. In addition a transactional protocol is used between modules in order to ensure that messages are not lost.

111 7.1 FAULT MODEL FOR THE CLIENT SERVER SYSTEM 99 Communication layer The communication layer is designed to mask link faults between the hand held computer and the fixed network 5. Link faults are contained inside this layer. In order to extend masking to include faults in the communication modules on the hand held computer and on the fixed network, we have to add more functionality. We want to guarantee reliable delivery of messages between the communication modules. This can be done by extending the communication layer with persistent storage of messages, a transactional protocol and a recovery procedure that uses data saved in persistent state. Application support layer The application support layer should offer delivery guarantee of requests to servers and responses to clients. To support this service, the application can make use of the message delivery guarantee offered by the communication layer. In addition, we need a transactional delivery protocol between the application support module and the communication layer modules. We also need to add to the application support layer a recovery procedure and let it save state to the persistent storage. Application layer uses the client server system composed of the application support layer and the communication layer. In order to recover from faults, the client part of the application needs to recover requests that have been issued to the system layer. This problem can be solved by letting the application store the necessary state or by letting the application support layer offer support for this. We choose to use application support layer that offers a list of unserviced requests Summary - Fault model for the client-server system The level of fault tolerance we want to support is delivery guarantee of request and replies in case of transient, recoverable faults. The fault model that we use only handles the client-server support system. The application has to ensure its own reliability. However, our system offers an interface so that the application can recover pending requests. We use a containment strategy where we have tree containment modules: The communication layer, the application support layer and the application. Our fault model is based on a point-to-point delivery strategy (store-andforward). We divide our system in modules that can recover them selves and a use transactional protocol to transport messages between them. machine). 5 actually between the MOM module on the hand held computer and the MOM module on the fixed network

112 7.2 HANDLING ERRORS IN MODULES Handling errors in modules Our overall strategy for recovery is that each module has the responsibility to recover itself. Because we risk that an error can occur at any time, we have to analyze a method of how to a module can reestablish it s state after an error. Therefore, in this section we will analyze the issues connected to reestablishing modules state after a fault. We will present what data each layer should save, and what responsibilities it should have. We will analyze where the recovery data should be stored. The analysis will include the following aspects: 1. Saving state necessary for recovery. What data should be stored to be able to recover every module and guarantee the delivery of requests and replies to the application. 2. Location of persistent state. In this section we will analyze where the persistent data is do be kept. We will argument for using local storage instead of storing the module s state on the network Saving state necessary for recovery Every module has different responsibilities and keeps different data. In order to revert the module into stable operational state, they need to save the necessary data in some persistent memory 6. In the following, we will present the data that are needed to be saved, to re-construct the module s state: communication layer The necessary state for recovery of a module in the communication layer is in principle reduced to the messages (remember there are a client module and a fixed network module). The messages encode all the necessary information (for example source and destination address). Depending on the implementation of the communication layer, there might be some additional state that should be saved. In addition there might be some state belonging to the transactional protocols that the communication layer modules use to communicate with peer modules. application support layer The necessary state for recovery of the application support module on the hand held computer is 6 The memory can be a flash ram or some storage on the network

113 7.2 HANDLING ERRORS IN MODULES 101 Information that links requests messages with responses messages. Any necessary policy information should be included in this. The request, until it has been delivered to the communication layer The response, until it has been delivered to the application layer. Information necessary to recreate the applications link to a request. Remember that the application support layer should support the application with this information. In addition, the cache can be kept on stable storage for convenience. The necessary state for recovery of the application support module on the fixed network (the agent) is: Information that link requests to servers with responses from servers. Any necessary policy information should be included in this. The request, until it is has been delivered to the server The response, until it has been delivered to the communication layer. In addition, there might be some state belonging to the transactional protocols that the application support layer module uses to communicate with the peer module. application layer This layer is not included in the reliability model. Because we offer delivery guarantee of requests and replies, we must deliver all the responses that have not been serviced even though the application crashed. The requests that have been issued and not serviced are the only necessary information for recovery that the application needs. As already mentioned, this information is supported by the application support layer on the hand held computer. By using this information to recreate links to pending requests at the time of fault, the application can fully recover pending requests. Notice that responses that had already been delivered to the application at the time of fault are the applications own responsibility Location of persistent state As mentioned, each module has responsibility for its own state for recovery. Each module can save the state on persistent storage connected to the node they are executing on, or on a remote node.

114 7.2 HANDLING ERRORS IN MODULES 102 For high levels of reliability (and availability) the modules might want to save state in replicates on different nodes. In general, the persistent storage on hand held computers is not as reliable as on fixed computers. Battery loss on a Palm computer will result in loss of the data and can be considered to be a hard fault. Loss of battery is in fact likely to happen and it can be argued that all necessary state for recovery of hand held computers should be checkpointed to persistent storage on the fixed network[pradhan 1996] 7. We can also choose to checkpoint to a fixed server on the fixed network. A more complex solution would be to checkpoint to the access points (AP/base stations). If we roam from one AP to another we can choose to use a complex recovery algorithm that collects checkpoints at APs, or we can chose to use a complicated handoff algorithm. In our system, we can expect to have periods of disconnections. In this period, we can not checkpoint state to the fixed network. When we are connected, we can transfer requests to the fixed network and they can be considered to be more safe. Each time we transferred a request to the fixed network, we could also transfer any state on the hand held computer (from both application, application support, and communication layer) necessary to recover logic connected to the request. The most elegant way to do this is to pack the information with the request into one message. The agent could hold the information until the server returns the response. Two options are open now: The agent could keep the state, and offer an API to the application support layer on hand held computers, so they can recover. The agent could pack the information in the response message. In the first case, the complexity of the agent gets bigger. The agent has to keep both state and the response message until it is sure that the client does not need it anymore. In addition, some special communication between the application support and the agent should be possible in order to retrieve the state information. In the second case recovery could be done on the fly. After a breakdown the application support layer can inspect messages in the communication layer. If it does not know the invocation corresponding to a message, it can assume that is comes from an invocation that was lost because of a hard fault. By using the recovery state packed in the message, it could recover that particular invocation. (Notice, the similarity with support for push). There are special security issues in this method: We have to be sure that the 7 Many hand held computers use different power schemes allowing them to survive longer time with low batteries. Palm from 3Com blocks the screen and can keep the data for three weeks without loosing them.

115 7.2 HANDLING ERRORS IN MODULES 103 response message is valid. Because we can not assume that the application support can identify this particular message, some special technics have to be used (for example by addition of certificate). In addition to the overhead on the request messages, the method imposes some overhead to the response messages. We generally think that saving state on persistent storage on the fixed network will impose to much overhead on the messages. In addition, the recovery procedures become too complex. We generally think that the persistent storage offered by hand held computers is adequately reliable for our purpose. This choice means that we can not guarantee reliable request delivery in face of hard faults on the hand held computer Summary - Handling errors in modules We have chosen the persistent storage offered by hand held computers, to store the recovery state. In the following layers, we need to save the following data in order to reestablish their state: communication layer includes messages that keep all the necessary information. The only data that should be saved are messages. application support layer should save two kinds of information for the part running on the hand held computer and for the agent. For the hand held computer the data to be saved for recovery are: Information that link requests messages with responses messages.(eg. a correlation number) The request, until it is has been delivered to the communication layer The response, until it has been delivered to the application layer. Information necessary to recreate the application s link to a request. For the agent the data include: Information that link requests to servers with responses from servers. (eg. a correlation number) The request, until it is has been delivered to the server The response, until it has been delivered to the communication layer.

116 7.3 HANDLING ERRORS BETWEEN MODULES 104 application layer leaves the application for taking care of it s state. When a response was delivered to the application, the application takes care of it, so it may need to save the request state itself. The modules have following responsibilities: communication layer guarantees reliable delivery of messages between the communication modules, hence it uses reliable stable storage to save the messages. This modules masks network errors. It can recover itself from the saved state. application support layer offers delivery guarantee of requests to servers and responses to clients. This modules offers a list of pending requests to the application. It can recover itself from the stable storage. application layer uses the communication layer s and application support layer s recovery procedures. It is offered an API to reconstruct the unserviced requests. The API is offered from the application support layer. 7.3 Handling errors between modules Transactional protocol Sending messages between computers (and modules) is critical in our system. In this section we will discuss the functionality of a transactional protocol that must be used to ensure reliability. We assume that the messages are secured as soon as they have been transferred from one module to another. The persistent storage and recovery procedures should guarantee this. In order to guarantee reliable transfer of requests and replies in the total system, we need to guarantee delivery of messages between the modules. We will simplify this problem by assuming that we have a reliable connection oriented communication protocol to our disposal (for example TCP/IP). This relives us from securing data integrity and also gives us FIFO-properties in the communication channel. In addition, we assume that we get a notification (Exception) if something goes wrong on the communication level. We will simplify our problem even further by defining that we want to transfer messages that encode request and responses, atomically. The whole message should either be transferred or not transferred.

117 7.3 HANDLING ERRORS BETWEEN MODULES 105 We will assume that we have a simple persistent queue system on both the sender and receiver. Let us assume that transfer protocol between the queues includes the following order of steps: 1. Sender copies message encoding request/response from persistent storage. 2. Sender sends the message to receiver (using reliable protocol). 3. Receiver receives message. 4. Receiver saves message on persistent storage. 5. Receiver sends acknowledge to sender (using reliable protocol). 6. Sender receives acknowledgement. 7. Sender deletes message encoding reques/response from persistent storage. After a breakdown of either sender or receiver, we assume that the sender simply sends the first message in the persistent queue. If the receiver breaks down between step 4 and 5, the sender will not know that the message was transferred correctly. In this case, the message will be retransmitted after recovery leading to duplication of the message. We can not fix the problem by sending the acknowledgment before saving the message: If the sender receives the acknowledgement, it must assume that it can delete the message from persistent storage 8. If the receiver breaks down after sending the acknowledgement but before saving the message the message can be lost from the system. The persistent MOM queues on the hand held computer and the fixed network side are accessed concurrently. This would be unacceptable in our system. The problem outlined here is part of a general problem ([Tanenbaum 1996], page 508). The receiver can either write first and acknowledge later or acknowledge first and write later. The main problem is that writing and acknowledging are two separate events. In general, the problem must only be solved by a layer over the protocol layer that holds additional status information. If we retain the write-first-acknowledge-later protocol in the simple model, we have sketched (two persistent queues with retransmission from head of sender queue), we will have a guarantee for at least once delivery. However, we also want to avoid multiple copies of messages. We can improve this 8 notice that because we use a reliable protocol we do not need to send an acknowledgement explicitly in this protocol. If the bytes were transferred correctly, we can take this as an acknowledgement

118 7.4 RECOVERING FROM SYSTEM FAULTS 106 protocol by adding a mechanism on top of it for detecting copies. For example, the server could hold information identifying the last received message for the particular sender (for example a running number, hash or the message it self) and could acknowledge and drop any retransmitted messages that have already been received correctly. Notice that this solution means that the status information should be stored on persistent storage, so that we can recover from receiver crashes. In addition, the state should be stored in an atomic operation that includes saving of the message to the queue. 9. Notice that in general, we will need atomic writes and deletes to persistent storage. Otherwise we might corrupt data. In the discussion above, we have focused on receiver breakdown. Notice that our protocol handles sender breakdowns too. The sender will retransmit a message until it is deleted from the sender queue. All logic and status necessary to avoid multiple copies are placed on the receiver Summary - Handling errors between modules We use a store-and-forward scheme. A message send to the receiver is acknowledged when the receiver has stored it on persistent storage. After receiving the acknowledge the sender can delete the message from its queue. This scheme offers At least once delivery. 7.4 Recovering from system faults Up to now, we have assumed that the modules can recover themselves. In fact, we have assumed that the modules have not crashed and do not hang. In general, this is not a realistic assumption. In case of a hanging or crashed modules, something outside the module is clearly needed to initiate the recovery process. In this section we analyze: 1. A model for recovering from module crash. In this analysis, we will look at different models for checking if a module is operational. We will present the different models and introduce watchdogs as the preferable solution. 2. Technics for checking modules. In this analysis, we will present methods of how to check if the watched modules are alive. This functionality is necessary for a watchdog to react in case of breakdown. 9 Otherwise, we will have a problem wether to save this state before or after we have saved the message in the queue

119 7.4 RECOVERING FROM SYSTEM FAULTS How to restart modules. The process of restarting modules is not trivial. We will analyze how to handle module breakdown and update a state of modules referring to the restarted module Maintaining module functionality The module functionality can be maintained using different techniques: 1. We could chose to let the user restart modules manually. 2. We could let the different modules watch over each other and restart crashed processes. 3. We could let a special administrative module watch over the modules. In our model, it would not be reasonable to let the modules watch each other: The communication layer is based on MOMS. This means that the communication layer is designed to be passive. Modules that need the services provided by the layers need to contact the layers themselves 10. Figure 7.2: Process pair of watchdogs One watching the system, the other watching other watchdog. The communication layer does not need to know where the client layers are. It would not be a good design to let this module watch over other modules. As mentioned before, we can not generally assume to have control over the server implementation. This means that we can not let this module watch 10 Some systems let the clients register an event channel and callback, but in general the MOMS are passive

120 7.4 RECOVERING FROM SYSTEM FAULTS 108 over the others. Finally, it will not make sense to let the proxy watch over the other modules. In our design there is only one communication layer module on the fixed network but potentially many proxies. These modules should then agree on witch of them that should play the role of watch dog. A more simple solution is to let a special administrative module (a watchdog) watch over the other modules. This module would have a well defined responsibility. In order to function under disconnections, there should be a watch dog on both the hand held computer and the fixed network. The watchdog should be reliable. The watchdog could be designed as a number of watchdogs (preferably placed on different nodes) watching each other and the modules. We think that a more simple design is adequate in our case (figure 7.2). The watchdog could be composed of a process-pair where one copy watches the modules and the other process and the other watches the primary process. Checking modules In this section we present the methods of how a watchdog can check if modules are alive. There are generally two methods[huang 1995]: 1. Modules sends a signal (so called heartbeat). Modules inform the watchdog about their status by sending a heartbeat 11. If a heartbeat does not come within the expected time, the watchdog treats the module as hung and can start recovery procedure. This method requires that modules know where to send the signal, unless the broadcast method is used. 2. Watchdog pings the modules and waits for response. The other method is initiated by watchdog, which sends a null message to the modules. If a response does not arrive the watchdog waits an application specified time, and sends a request again. When the response does not arrive it treats the module as hung. Pinging, is less efficient since more traffic is created. On the other hand, the watched modules do not need to know where to send the responses, since they only need to reply on them. This method is preferable when we use watchdog pairs and watchdog can be restarted. The watchdog can also use one of the mentioned methods to watch itself. In order to do that it just makes a copy of itself. When it decides that the 11 A heartbeat can include a parameter telling when the next heartbeat will come

121 7.4 RECOVERING FROM SYSTEM FAULTS 109 watched watchdog is hung it recovers it by making a new copy of itself and let the copy watch itself. We use the method of pinging the modules Restarting modules When a module hangs and we must restart it, we have to remember that the hung module may have been accessed by other modules that are still running. Restarting the module without updating the modules that refer to it, will not make the system operational again. There are different solutions to this problem: 1. Static addressing All modules are predefined to be accessed by one method. The server listens on a predefined port and IP, the agent contacts communication layer by accessing a predefined database. Even though the processes restart, their placement is the same. This method can be used only when the static addressing is possible. It can for instance be network environment where we can access a set of IP and port numbers. In a multiprocessor environment or in more complex systems, we can not know the placement of modules. Therefore, we need to use one of the later mentioned methods. 2. Broadcasting When a module is restarted, the information about restart is broadcasted to all the other modules. The modules receive the updated reference to the restarted module, so the functionality will be maintained. Broadcasting requires that all the modules are able to receive the broadcast. A negotiation algorithm is needed to ensure a stable final state where all modules has all the necessary information. 3. Naming service Every access between modules goes through a naming service, that is maintained by the watchdog. When a module is restarted the watchdogs updates the naming service and applications access the correct module. This is a very costly method since every request is expanded with an additional request to the naming service. Additionally, the naming service can become a bottleneck.

122 7.5 SUMMARY - RELIABLE CLIENT SERVER SYSTEMS 110 This method can be optimized by letting the module contact other modules without contacting the naming service. Only under a failure, the module can ask the naming service about a updated version of the requested object. In our model, we chose the naming service maintained by the watchdog. On the on-line side the watchdog needs only to restart the modules, because they are accessed by static addressing. On the hand held computer, we use the naming service maintained by the watchdog. Modules can get the addresses from the watchdog, and use them for later access. In case of error they can re-request the watchdog and update their naming list Summary - Recovering from errors Modules can recover from soft errors by restarting. We need to have an external module that restarts the modules in case of break down. We introduce additional processes (watchdogs) that are responsible for watching and restarting modules. To check the system, we introduce a pair of watchdogs, where one watches modules and restarts them if necessary, and the other watches the other watchdog. The modules are checked by pinging. In order to restart the hung modules and keep the valid references to them, we introduce a naming service that is maintained by the watchdog. On the on-line part we use stable addressing, and let the watchdog restart processes that hang. 7.5 Summary - Reliable client server systems In order to make reliable client server systems, we have to handle application errors. A method that prevents errors from spreading is the containment method. We use the containment method and place the functionalities in separate modules. The general client server support system guarantees delivery of requests to the server and return of responses. In order to have a reliable application, we support the application in reestablishing its not serviced requests. After receiving the response, the application will have to recover other state necessary for correct function. To support recovery of the modules, we save data in the persistent storage offered by the hand held computers. For each layer we save: communication layer needs only to save messages.

123 7.5 SUMMARY - RELIABLE CLIENT SERVER SYSTEMS 111 application support layer saves data that are necessary to reestablish connections. The data on client side are: Information that links requests messages with responses messages.(eg. a correlation number) The request, until it has been delivered to the communication layer. The response, until it has been delivered to the application layer. Information necessary to recreate the application s link to a request. For the on-line part (the agent) the data are: Information that link requests to servers with responses from servers. (eg. a correlation number) The request, until it is has been delivered to the server The response, until it has been delivered to the communication layer. application layer can reestablish the outstanding requests from the application support layer. After accepting the response, the application takes care of its state. Every module (but the application) can recover its state from the mentioned data that are saved in persistent memory. Apart from keeping the module s state in persistent storage, we also need to take care of delivering data between modules. Reliable delivery of data can be handled by using store and forward method. Additionally, we need to use a transactional protocol that guarantees that messages don t disappear during transportation. When a module hangs, we use a watchdog to restart it. The watchdog checks the modules by pinging.

124 Chapter 8 Application supporting a work-flow process in a hospital In chapter 2, we have described the background for the hospital application. In this chapter, we will analyze different problems concerning the hospital work-flow that we will build using our general client-server system. We will structure the analysis in 3 sections: 1. Overall approach. In this section we chose what part of the general system we need in order to build the application. 2. Analysis of synchronization in work-flow processes. In this section we discuss the synchronization problem that is connected to disconnections. We will show, how to synchronize activities in a system with disconnections. We will argue that support for disconnected work requires that we can transport data outside the system, or that we can create advanced matching procedures that are based on semantic dependencies between the data. Additionally, we will analyze the application for the hospital work-flow. 3. Integration with Radiometer s existing IT-system. In this section we discuss using a http-corba transformation agent to integrate our system with Radiometer s existing IT-system. 112

125 8.1 OVERALL APPROACH Overall approach In the precious chapters we have analyzed a general client-server system. We believe it is possible to build the hospital work-flow application as a thin client application. By using a browser application on top of our client-server system we can structure the hospital work-flow application in a number of form-applications and a central server. We believe each activity in the hospital work-flow application can be expressed in a single form. This means that each activity only involves one request where the form is submitted. This request can easily be queue during disconnections. The form-application objects can be cached in the proxy. When working with the system we imagine that the user log-in on the system (By submitting a log-in form) and download the necessary form-application objects while connected to the fixed network. From this point on the user should be able to use the system continuously even if the hand held computer disconnects periodically from the fixed network or the central server breaks down for a period of time. When the hand held computer is connected the client-server support system exchange messages with the fixed network in the back-ground. We will not change downloaded objects and consequently do not need the write-back feature that we analyzed for the general client-server support system. Our server is a complex server and we therefore choose to use the support our system offer for this kind of servers. 8.2 Analysis of a work-flow process A work flow process is a long-running task that involves the coordinated execution of multiple activities performed by different nodes ([Alonso 1995]). In this section, we will clarify different aspects of work-flow in a disconnected system. We will analyze following aspects: 1. Analysis of task execution in a IT system. 2. General description of a work flow management systems. 3. Work-flow system in a hospital.

126 8.2 ANALYSIS OF A WORK-FLOW PROCESS Analysis of task execution in an IT system A task is a set of activities. Activities can either be executed in series or in parallel. When execution of all activities is finished, then the task is completed. If execution of one activity is dependent on the result of an other then the activities must be executed serially. On the other hand, when execution of one activity is not dependent on execution of another activity, then they can be executed concurrently and their results must be matched at the end. If a work-flow process is executed using our IT system, the activities are executed on the computer devices in forms generated by the server. An activity is therefore expressed as a form. In disconnected systems, communication between activities can be delayed. The delays in the system, may influent the execution of activities. In the following, we will look on how to support disconnected systems in executing serial and parallel activities, respectively. Serial execution of activities Figure 8.1: Serialized activities. If activities are serialized then activity B can either 1) be executed by transporting data from activity A using the system, or 2) by transporting the data externally. If activities are executed in series, then the data collected in one activity must be transported to the next activity. As shown on figure 8.1, we can transport data between activities in two ways: 1. By intern system communication. The data are transported from one activity to an other by direct communication or through other nodes (for instance a central server). 2. By extern communication. The data are transported outside the system for example by oral communication or by physically transporting an item.

127 8.2 ANALYSIS OF A WORK-FLOW PROCESS 115 Note that some activities that do not have to be executed serially can still be executed after each other if it makes coordination easier. Because activities can be executed on two different computers that are disconnected, we may have problems in communicating the data from one activity to the other. The second activity must wait until the data arrive through the system, or the data must be transported externally. If the second task can not wait for the data from the system, the system should support external communication of the data. Figure 8.2: Merging activities. Activities can be merged in three ways: 1) By transporting a marge ID using system, 2) By agreeing on an ID outside the system, 3) By using semantic dependencies to merge the data. Parallel execution of activities Parallel execution of activities requires a method to merge the activities at the end. As shown on figure 8.2, there are three methods of merging the activities: 1. By using a unique ID for the activities This method requires that activities have a unique ID when they start. For instance, we can use an activity distributing system that places a unique task ID for every activity. Using this ID, the activities can be easily merged. If we use a disconnected system, we may not always be able to distribute activities from one central system. Instead, we can use one of the later methods. 2. By transporting a unique ID outside the system. If we want to start the activities without contacting the system or the system is not available, we can transport the unique ID externally, for

128 8.2 ANALYSIS OF A WORK-FLOW PROCESS 116 example by agreeing orally on a unique ID which will be included in the activity data. This ID will be then used to merge the activity data. 3. By finding semantic dependencies between the activities. If we know that two activities have some relations between the data, we can use the dependencies to match the activities. An example could be a time stamp for two activities. If we knew that they were performed at the same time, we could use it to match the activities. Every time, we allow a parallel execution of activities, we have to merge them using one of the mentioned methods. The methods should guarantee that activities will be merged correctly. In disconnected systems, we can allow parallel execution of activities by either distributing the activities in advance, or by letting external input of the match data, or by finding semantical relations between the data of the activities. The last method is used, if we can not communicate outside the system. Interdependency between activities The work-flow processes are usually more complex. We have not only simple relations between two activities but a system of inter relations. Therefore, activities in a work-flow are typically interdependent. For example as shown on figure 8.3, two activities can be dependent on one activity, or an activity can be dependent on the completion of two other activities. The results of the two activities might have to be merged. Concluding, the activities can have complex dependencies Work flow management systems Work Flow Management Systems (WFMS) are general tools used to control work flows. The logic needed to control work flows can be centralized or distributed to the nodes that are involved in the work flow. There is a number of WFMS but only a few support periodically disconnected nodes [Alonso 1995] [Klingemann 1997]. Anyway, a work-flow control system usually involves: 1. Rules defining the interdependencies between the activities. 2. A process that distributes and assembles data using the rules.

129 8.2 ANALYSIS OF A WORK-FLOW PROCESS 117 Figure 8.3: Dependencies between activities. Activities 2 and 3 depend on Activity 1. The result from these two activities has to be merged. Activity 4 depends on the merged result. Rules defining the interdependency between the activities On figure 8.4, we present the dependencies between the activities in our work flow process. We distinguish following activities: 1) Create requisition 2) Take a blood sample 3) Create a sample data 4) Measure the data 5) View the measured data The real blood taking process in a hospital, only includes activities 2) 4) and 5). Because of administrative purposes, the task was expanded with 1) and 3). A blood taking process also includes additional activity (the acceptance of results), which was not included in the figure. In the following, we present the rules which can be used to match the activities in the work-flow. Matching activities 1) and 3). Activity 3) needs requisition ID from activity 1) to be matched correctly. If we can not transport the requisition ID from activity 1) to

130 8.2 ANALYSIS OF A WORK-FLOW PROCESS 118 Figure 8.4: Dependencies between activities. The process is started by activity 1) the doctor by creating the requisition. Then data are transported to a nurse, so the activities: 2) taking the blood sample and 3) creating sample, can be performed. Then, we perform activity 4) apparatus measures the data. Finally activity 5) the doctor can view the data, is executed. The whole work-flow process includes activities ) then we can either transport a unique requisition ID externally (by agreeing orally on a given Requisition ID), or we can use some matching rules. For example, patientid must be identical in activity 1) and 3) and the creation time of activity 1) and 3) must not be too distant. In the system we would prefer both operations to offer the maximum flexibility. Matching activity 1) and 4) Activity 4) needs the measurement parameters from activity 1). If we can not deliver the measurement parameters, we can either: 1. transport data externally (As a label on the blood container), or 2. measure all the data If data are not delivered, then we measure all data (When data are measured in the Radiometer apparat, we usually measure all of the data and then we filter the unneeded results).

131 8.2 ANALYSIS OF A WORK-FLOW PROCESS 119 Matching activity 2) with 4) is performed in the measurement apparat. The results of 4) are labelled with the sample ID from the syringe which is included in activity 2) Matching activity 3) with 4),and activity 3) with 5) is perfumed using the unique sample ID. If we can match activity 1) with 3) using requisition ID, then we can easily match activity 1) with 4) and 5), because activity 3) is matched with 4) and 5) by using the sample ID. On the other hand, If we can t match activity 1) with 3) with the requisition ID, and we can not transport data externally between activity 1) and activity 3), we must create a merging procedure. The merging procedure can for instance use the patient ID and the time to match activities 1) and 3) We use the merging procedure to support the disconnections. A process that distributes and assembles data using the rules The application which we design should be able to control the execution of activities that depend on each other. The application should also merge the activities. As shown on figure 8.5, the activities are distributed using a central WFMS. Because we want to allow activity 3) to start (even though the activity 1 hasn t send the data, because either connection a. or b. was not possible), we can create an external link 1b or create a merging procedure for activity 1) and 3). Because we can start activity 4) right after receiving g, we don t need to wait for connection d, because we can apply the changes in WFMS after receiving data in WFMS from e. We can block execution of activity 5) until connection e. is completed. We can see that we can execute activities 2) 4) and 5) without contacting the WFMS. Communication from 4) to 5) can be done by printing a report of the measured data. The data that are collected and communicated to the WFMS, can be transported at a later time and merged using the mentioned rules. The WFMS uses the rules from the previews chapter, and it still allows parallel execution of activities. It should be noted that the merge procedure that is need when we merge data is the same wether we merge data based on complex sematic relations or just a common unique ID. The merge procedure just use some match rules and the complexity is expressed in the merge rules.

132 8.3 INTEGRATION WITH RADIOMETERS EXISTING IT-SYSTEM120 Figure 8.5: Transporting activities in the system. WFMS distributes the activities. Digits present activities, and characters present communications of data. Results of activity 1) are transported to WFMS a. Then, if we want to execute activity 3) then we can either wait for data from WFMS b. or start it and match it with 1) in the WFMS. To start activity 4), we need data from activity 2), and additional data from activity 1) that is transported using WFMS d. Finally, the WFMS can send the results to activity 5) using f. Or, we can use a printed report to view the data (h). 8.3 Integration with Radiometers existing IT-system On the client side, the work-flow application is constructed of a number of forms. The work-flow server logic controls the work flow by generating the forms in response to the client requests(the requests are internet type client server requests like http-request). These http-requests could quite easily be transformed by an agent into CORBA-calls to RIME. Rime already collects measurement results and contains a number of data structures necessary to build the application logic 1. Placing the logic for the merging procedure and task distribution in RIME would be a logical step. Then, the agent s responsibilities would be: 1. Adaptation to different types of devices. 1 Rime already contains patient data structures, requisition data structures (Testorders) and personnel data structures

133 8.4 CONCLUSION ON APPLICATION Transformation of http-request to Corba-request. 3. Generation of form based on corba-response and type of device. 4. Sending back http-response to the http-agent (The agent transforming from asynchronous to synchronous requests). However, from an implementation point of view, placing the logic for a prototype in RIME is not a good idea. RIME is a very a complex server. The data needed from RIME to demonstrate the principles are relatively few. In addition, RIME has a relatively simple interface that makes it easy to read and write data to it. Therefore, we have chosen to implement our prototype in a separate server that synchronizes with RIME. When going from a prototype to a commercial product, the merging logic and task distribution logic should be placed in RIME (for example to avoid synchronization). In our design, we will clearly separate device adaptation logic, form-generation logic and http-protocol specific logic from the merge logic and task distribution logic (business logic). In this way, it becomes relatively simple task to move the logic to the RIME. 8.4 Conclusion on application Work flow processes include activities. Using disconnected work-flow systems, we may be forced to execute the activities without being able to communicate within the system. If we need data from one activity before we can execute an other activity, we must have a method for transporting data outside the system. If we allow parallel execution of activities, we must merge the executed activities at the end. The activities can be merged by using centralized activity generator which labels activities with a unique ID. In the disconnected systems, we may not be able to contact the activity generator and hence not be able to execute the activities. Instead, we can use a semantic data merging procedure, or an ID which is communicated outside the system. In that case we can be able to execute activities without contacting the activity generator. Disconnected systems that allow instant execution of activities need to have data sources from outside the system that can be used to merge the activities, or there should be a semantical relation between the data. The semantical relation is used to build a merging function. In the supported system for the hospital, we need to use all the methods to make the system flexible.

134 Chapter 9 Analysis Conclusion We want to support a simple reliable client-server model where the client is executed on a hand held computer that is periodically disconnected. We analyzed general client server systems. Subsequently we analyzed different problems that are specific for the system we want to build. In the following we present the implications of the discussed problems: 1. Choice of client-server model 2. Implications when using hand held computers. 3. Implication when client is periodically disconnected. 4. Implications of reliability on a client-server system. 5. A hospital application using a disconnected system. We have taken into consideration that a system that handles the three main problem - limited reccorces on hand held computers, disconenctions and reliablity - will mean trade-offs between the optimal solutions in each case. 9.1 Choice of client-server model Depending on application support, the client server model can be used in different levels of abstraction: from most abstract where communication is hidden for users (CORBA, RMI) to least abstract where a programmer must implement all communication aspects. We have used an internet type client-server model exemplified by the httpprotocol. The internet type model uses parameterized invocations and is less complex than CORBA and RMI. 122

135 9.2 IMPLICATIONS WHEN USING HAND HELD COMPUTERS Implications when using hand held computers Hand held computers tend to be relative weak with respect to processor speed and memory. This is an argument for off-loading as much as possible computing to the server. The thin client model is a solution that meets this requirement. In its pure form only key-strokes and mouse movements are send from the client to the server and only screen updates are send back to the client. The application executes completely on the server. This solution is very sensible to network latency and disconnections (if the complete server is not replicated on the client). Instead of using a pure thin client, we will use a browser type thin client that is characterized by having a bigger part of the application executing on the client computer. Consequently, it should be able to function more autonomously. It still has the advantage of the pure thin client that installs only a small foot-print of the application (here the browser). We have chosen to let the server adapt to the specific hand held computer platform that is used. In order to handle the general network limitations of hand held computers we use a general design using a proxy on the hand held computer and an agent on the fixed network. Many different technics can be used with this design, but we have chosen only two technics: To multiplex request over an single and to use caching. Exactly how we use caching depended on the solution for disconnections. 9.3 Implication when client is periodically disconnected The solution to the problem of periodic disconnections depends on the extension of the disconnections in comparison to the application s time needs. If disconnections are very short, compared to the applications demands, changing low level protocols might be enough to fix the problem. If disconnections are longer, programming with a client-server model with asynchronous primitives might be a solution. If the applications actions can be executed in one request this type of disconnection support is adequate even with extended period of disconnections. If we can not avoid extend periods of disconnections, the application logic will probably be affected. Some server functionality (and data) may be necessary on the client in order to function autonomously. This server logic and data can be present as objects or light-weight server replicates.

136 9.4 IMPLICATIONS OF RELIABILITY ON A CLIENT-SERVER SYSTEM 124 We have proposed a model based on a simple and general proxy with access to a cache and an agent. We have showed how a browser application build on top of this solution can be used to make a thin client system that can handle disconnections. We have analyzed this model in detail. 9.4 Implications of reliability on a client-server system We want to support delivery guarantee of request and responses in case of transient, recoverable faults. We have done this by using a fault model structured in two parts: Modules that can recover based on persistent storage Transactional protocols for communicating between modules We have split our system in 3 containment modules: A communication layer, a application support layer and an application layer. Our system only supports recovery of the two first layers. However, our system offers support for the applications recovery of pending requests. We have added watch-dogs in order to restart modules that break-down, but the modules have the responsibility to recover them selves after restart. 9.5 A hospital application using a disconnected system A hospital application may be dependent on disconnections. Disconnections introduces communication delays. If we can not accept these delays we must allow parallel execution and later merging. Executing tasks in parallel requires that they will be matched into the task they are a part of. The matching can be performed by transporting a unique ID outside the system (for example by agreeing orally) or by finding semantical dependencies between the task data and creating advanced merging procedures. We want to support both methods.

137 Chapter 10 Design In this chapter, we will present the design considerations of a client server support middleware that can be executed on hand held computers, that handles disconnections, and offers reliable communication. We start the presentation with a general presentation of our design. Thereafter, we will describe the design consideration relevant for each layer General overview of the design On the figure 10.1 we have presented the design overview of our system. The system is composed in three layers. Each layer has a part placed on the hand held computer and a part placed on the fixed network. Additionally, we include the watchdogs on both sides. The single modules can be regarded as independent processes, threads or libraries. However, on some hand held computers(like the Palm-platform), they can only be either libraries or threads because multi processing is not supported. The system modules will be presented here: Communication layer offers asynchronous communication primitives and guarantees reliable message delivery. The communication layer can mask communication link failures. The layer is based on a message oriented middleware(mom). The layer can be fully recovered from state stored on stable storage. The communication layer is split between the hand held computer and the fixed network. The MOM can use any reliable transport protocol for sending messages between the mobile part and the fixed network part. We will present this layer in detail in section

138 10.1 GENERAL OVERVIEW OF THE DESIGN 126 Figure 10.1: Design of the system on client and server side Application support layer offers an internet type client-server API designed for disconnected operation. The application support layer is mainly responsible for transforming the requests and responses into messages and back. The layer is composed of a proxy on the hand held computer and an agent on the fixed network. The agent makes synchronous requests against servers on behalf of the client. The proxy receives asynchronous requests from the client application and uses the communication layer to communicate asynchronously with the agent. The proxy is very primitive and includes a cache that is used to service requests directly from it. Both proxy and agent can fully recover their state using data stored in persistent storage. The application layer supports delivery guarantee of requests and responses. The layer is presented in details in section Application layer is divided into two parts: A general XML-browser application and an application that supports the hospital work process. The general XML-browser application consists of two parts: 1. A browser application that execute on the hand held computer The browser can show forms that comply with a XML-encoding specification.

139 10.1 GENERAL OVERVIEW OF THE DESIGN The form generator placed on the fixed network. Server uses the form generator to produce the forms for the browser. The browser offers a special page, where pending requests are accessible in a table. The browser shows the connection status and if there are messages in the communication layer. The communication layer and the application layer offer these information through a special API (application level awareness of disconnections). On top of the browser, we have a dedicated application layer. The application layer supports the hospital work process in a hospital. The application layer is a server that interfaces with Radiometer Medical A/S existing hospital system. The application layer is presented in detail in Watchdogs are present on the hand held computer and on the fixed network. The watchdogs watch the three layers and ensure that failed processes (or modules) are restarted. The restarting processes recover themselves. The watchdogs include a primary process that watches the processes in the layers, and a secondary watchdog process that supervises the primary watchdog (process-pair design). In the analysis we have mentioned there are some reasons for dividing the design into layers. Here are the major ones: 1. Easiness of design The design is easier to overcome and test, when split into smaller parts. 2. Reliability reasons By dividing the design in layers we can use containment techniques (presented in section B.2.1). Internal faults in the modules can be handled by the modules themselves. By clearly defining responsibilities, handling faults becomes easier. Likewise, it becomes easier to define the necessary state needed for recovery and let the modules recover themselves (upon restart). 3. Flexibility The first two layers: communication and application support layer, are of general use. These two layers makes out the client server middleware for disconnected work. The communication layer is basically an implementation of a MOM system. This system can be used directly for building reliable client server applications on hand held computers. The application support

140 10.2 COMMUNICATION LAYER DESIGN 128 layer offers a general client server API with asynchronous invocations and support for disconnected operation. The layer can be used for building client server applications for environments with periodic disconnections. The off-line browser application that we build to support the hospital work process is a general thin client application. Other applications can be build by using other server applications. In addition, our design can tolerate new implementations of each layer as long as the basic interface is met. It is easy to introduce a new protocol specific proxy and agent, that will serve new applications Communication layer design The communication layer is the layer that is responsible for exchange of message between the hand held computer and the fixed network. In the design we have focused on several aspects: 1. Flexibility. This section addresses the openness of the design. We will present the different ways of using the communication layer. 2. Addressing. In order to support the flexibility, we will present how we can address the different queue clients. 3. Automation. In this section we will present the design choices that support automatic message transportation. 4. Reliability. In this section we present how the communication layer can ensure the reliable delivery of messages Flexibility of Communication layer When designing the communication layer (using MOM), we have chosen to support all the three kinds of communication presented on figure We have chosen to use the JMS specification as an interface to the MOM (queue). The JMS can be used in all kinds of communication. Usually, one queue is used to support communication between two applications (as in example (a) or (b)). Because of disconnections, we need to split

141 10.2 COMMUNICATION LAYER DESIGN 129 Figure 10.2: Design of the communication layer using MOM, with the JMS interface. Two applications communicating using: (a) one queue on the same computer (b) one queue lying on remote computer (c) queues lying on their own computers. the queue so the messages are stored locally and transferred to the central queue using inter-queue transportation (as in example (c)). The inter queue transportation (arrows on example (c)) can either be use JMS interface, or a specific protocol between the queues. JMS interface allows to send and receive one message at a time. During disconnection, we can accumulate a list of messages that should be transported, hence we would have to write more than one message at a time. Receiving messages happens by issuing a receive request on the queue. Having messages on both queues would create a lot of traffic. Therefore, we have chosen a specific inter queue protocol that communicates all the messages between queues.

142 10.2 COMMUNICATION LAYER DESIGN Addressing messages The messages in the queue are identified by using source/destination fields. We have chosen to identify objects by URLs. Every message is stored in the queue for a specific URL. URL gives the flexibility in addressing and can address different applications on one or many computers. The URL includes information about the protocol and the computer. Example: An application called browser, using http protocol, running on handheld1, can be accessed by: While application called app running on the same computer using the same protocol can be accessed by: The same application can have different queues for different protocols, and different port numbers. (Port number is also a part of URL). Messages having different destination computer IDs, than the local queue are regarded as outgoing messages and putted in the outbox. When contacting the global queue, the local queue-transport-process is responsible for transporting all the messages to the global queue and from the queue to the local computer. Because the communication layer is used to service the application support layer, which uses unique message ID to see if data has arrived, the queue was expanded with the possibility for asking for the messages with a certain ID. This interface to the application support layer improves the polling efficiency Automation of message transportation On figure 10.3, we present the extension of communication layer that makes it possible to automatically transport messages. The CommManager is a process that performs the synchronization of queues in the background (For example every 1 sec). The inter queue transportation is initiated by one queue, while the other queue is waiting for requests. The waiting queue accepts two kinds of requests:

143 10.2 COMMUNICATION LAYER DESIGN 131 Figure 10.3: Design of the communication layer, that automatically transports messages between queues. 1. queue-to-queue requests, are used to synchronize the messages between the queues. 2. peer-to-queue request, are used to support JMS communication i.e. sending and receiving one queue. The CommManager is a process that connects the persistent queue and the low level network connection. The Network and Communication layer can be easily exchanged, if we want a better network support (for example a special wireless protocol). When a local CommManager contacts a global queue (it can also contact other queues) they use following algorithm to exchange data: 1. The client queue contacts the global queue by sending either an empty request (EMPTY with a queue name), or by sending a message waiting in the outbox 1 2. The global queue, reads the sender s ID, and looks if there are messages waiting for this ID. The sender s ID identifies it s Inbox on the global queue. 3. If there are messages for this ID, the queue sends the message, otherwise it sends an EMPTY 4. If the local queue receives EMPTY, and it has send an EMPTY then there is none messages to be exchanged - the synchronization is finished. Otherwise goto 1. 1 Messages are putted in the inboxes for the local applications and in the outbox if the address is not local (address is local if it uses the same IP number)

144 10.3 SUMMARY - COMMUNICATION LAYER DESIGN 132 Disconnections can happen during the queue synchronization. In that case, the CommManager has to try to synchronize the queues at a later time. By using a reliable protocol the messages are transported in a reliable way Reliability of communication layer Form view of the application support layer the communication layer is a reliable communication channel. The communication layer offers delivery guarantee of messages. In order to do this the messages are kept in persistent local storage (Persistent Database). We use the store and forward method, supported by the reliable protocol. The reliable protocol (TCP/IP) is used to support the unreliable connection like Wire-less connection. Using the reliable protocol, we can be sure that data were transported successfully. We assume that data, after being accepted, can be saved in the persistent storage. After successful transportation the data are saved in a persistent memory on the recipient side. First then, data are deleted from the senders persistent memory. Messages that are kept in persistent memory can stand soft errors. Using recovery techniques we can come back to a stable state of the communication layer Summary - Communication layer design The communication layer is designed around a Message Oriented Model. The MOM was expanded with store and forward functionality, to ensure reliable communication. The implementation of MOM was split into two parts: the hand held part and the fixed network part. The two parts communicate using inter-queue communication methods (all messages are synchronized in a series). They also accept client calls using JMS interface. The addressing of queues is organized by URL addressing. Using URL we can address applications on the same computer but also applications running on different computers. Messages can be transported automatically in the back-ground by a synchronization process. The communication process initiates the inter queue communication and tries to transfer as many messages as possible. A reliable protocol is used in combination with the store and forward method that ensure the delivery of messages between queues. To sustain crashes, messages are kept in a persistent memory.

145 10.4 APPLICATION SUPPORT LAYER DESIGN Application support layer design Figure 10.4: Application support layer As shown on figure 10.4, the application support layer is used to connect the application with the server. The application support is a service that is responsible for delivery of requests and replies using underlying communication layer. Application support layer services the applications by performing their requests during connection and disconnection. We will present the design choices for supporting the mentioned functionality: 1. Connecting application with communication layer The application support layer is mainly responsible for connecting the application with the communication layer. It means transforming the request/response into messages and back. When an application makes a request, the application support transfers it into a message. The message is transported using communication layer (which was described in section 10.2), and then fetched by an agent 2. The agent unpacks the message and executes request against the server. The response from the server is then packed into a message and sent back to the client. The application support on the mobile computer unpacks the message end returns the response to the application. The design of this functionality will be presented in section Having an open design. One part of the application support is executed on the application, the other is the proxy that is a background process. Many applications can use the same proxy. We have chosen an open design where many applications can use the same proxy. In we will present how the application support layer is split. 2 Agents are discussed in section 6.4

146 10.4 APPLICATION SUPPORT LAYER DESIGN Managing requests during connection and disconnection The application support layer manages requests, and services the requests during connection and disconnection. The proxy can service requests directly from the cache. It acts, then, as a server for the application. The extension of the proxy with a cache will be presented in section Performing requests to the server and collecting replies. The on-line part of the application support layer is executing requests on behalf of the client application. The extension of server side that allows executing client requests will be presented in section Connection to communication layer When making a new request, the application support layer creates a connection between the application and the communication layer. To service the mentioned functionality, there are several requirements for the application support layer: 1. Connecting requests/replies with messages All the processing that takes place in the communication layer is maintained by the proxy. In order to connect request/replies with messages, the proxy has to: Translate a request into a message(s) When a request is made, it is an asynchronous request that is transferred into a message, that is placed in the queue. Reconstruct the reply from the returned message(s) When the answer message arrives, the application support layer reconstructs the answer from the message, and returns the data to the application as a stream. 2. Handling many requests This applies both to the many requests from the same application, and to many applications requesting. 3. Informing about delivery Optionally, the application support may be expanded with informing facility. This facility may be important for applications that inform about delivery status.

147 10.4 APPLICATION SUPPORT LAYER DESIGN Persistence Because application believes that requests are delivered as soon as its accepted by the application support layer, the system should be persistent. It means that in case of application or application support layer s error, the status should be recreated. Later we will present the design that solves the mentioned issues. Connecting requests/replies with massages When a request is created it is transformed into a message. The message is then send, and the returning answer message is transferred into a reply. The message includes the sender and recipient address. Additionally the message can include additional policies like time to live or control information that is used by application support processes: proxy and agent.(proxy and agent functionality will be presented in and , respectively). In order to deliver messages to communication layer we just use the communication layer s asynchronous send method, and the communication layer takes care of delivery. After the communication layer transforms the request message to the server, it transports back the reply that is packed into a message. The reply message is fetched using polling technique. The polling is initiated by the client application, that has the access to asynchronous poll primitive, which is a part of the application support. The reply message is fetched using polling technique. The polling is initiated by the client application, that has the access to asynchronous poll primitive, which is a part of the application support. The poll returns the status about the request which is either: DATA AR- RIVED, NOT READY, or SERVER UNACCESSIBLE. When data has arrived the client can read it using a stream. If its not, it can re-request at a later time. The poll method is performed on a connection object that is kept alive by the client application. Handling many requests In order to support many requests at a time, we open a possibility to open many connections to the application support layer. Every connection needs, then, to be identified. To be able to create unique identifiers, we need a pool of unique IDs. This can be solved by having a centralized ID creator, which will be used by every socket and every application.

148 10.4 APPLICATION SUPPORT LAYER DESIGN 136 The unique ID is used to identify messages that are sent. When a reply message is constructed, the same ID is used. Using this ID the replying message can be easily matched with the waiting request. The unique ID is only needed on the application level, where many requests must be identified, since URL addressing (presented before) is comprehensive for routing messages between computers and applications. Persistence To support this requirement, we have decided to save the application layer s status in the persistent storage. The table with opened sockets and the opened data is saved in a data base. In case of break down it is recovered. The information that is saved for the application support: request data On the hand held computer side, they are saved as long as data are not delivered to the communication layer. On the fixed network side, the request is saved as long as server has not serviced the request correlation ID Is saved until the application fetches the data response data On the fixed network side, the response data are saved till they are delivered to the communication layer. On the hand held computer side they are saved until the application fetches it Separating application interface from the request reply manager In a multi process environment, the application support layer may be a separated proxy process, that filters all the in and outgoing communication. The communication with a separate process can t be done through shared memory but with a network communication. On the thin client like Palm OS, we can not run many processes at one time. Anyway, we have separated the application side from the request engine (request reply manager) to make the design more open. Our application support layer is split into:

149 10.4 APPLICATION SUPPORT LAYER DESIGN 137 Figure 10.5: Dividing application support. in application side API, and request engine. 1. Application side API It consists of URL, and HttpConnection classes. These two classes are used to establish connection, therefor they are always run on the application side. 2. The proxy process Includes the proxy, and the cache. Proxy is the client to the local Queue 3. The http-connection object is a client to the proxy. These two sides communicate by exchanging messages. The Application side API acts as a client for the proxy. The request engine services the application side s requests. It can service many applications, because every application identifies itself when making a request. We simulate the socket connection to the Request engine, by calling the talk method of the proxy. The method accepts requests and sends replies using a special protocol Servicing requests during disconnection using cache The client side was expanded with additional functionality to service requests during disconnections. The application support was expanded with the cache. Having copy of working data in a local memory, we can be independent of the server. In that case, the application support can act as a server, for the application. Whenever the application makes a request, the proxy checks if the request can be serviced from the cache. Servicing requests from cache includes following functions: 3 Communication layer implementation

150 10.4 APPLICATION SUPPORT LAYER DESIGN Finding data in cache - we use URLs + request method to identify requests. Because POST requests parameters are not included in the URL, we use cache key that is the hash value of URL and POST request data. 2. Managing outdated cache entries Every time a client issues a request, we look in cache and check if cached object s hashed URL is the one we look for. At the same time we control the TTL of the cached objects. When they are to old they are being removed from the cache. 3. Placing cache - we have chosen a centralized cache, where many applications store data. In this way the redundancy of data is limited. 4. Decide which data shouldn t be cached When a reply for a request arrives we may not be willing to store it in the cache. For example: requesting a counter doesn t give meaning to store the previous value of a counter even if request was the same. Because on the application support layer we can not know what data can be cached, we have chosen that application decides whether the data should be cached or not. 5. Support data updates Application can use put request in two ways: on a simple data server and on a complex server. Application informs whether it s a data or process server. When a simple data server is used we use the hash(url) to identify the cache entry. Then we update the cache and send the update to the application. When a complex server is used, we use the whole request to identify the cache entry. Additionally, we need a data update process (a copy of server logic) that will perform the update and inform the server about how the update was performed Application agent The fixed network part of the application support layer is an agent. The agent executes requests on behalf of the client application. The agent can vary from a simple request-performing-engine to a more sophisticated one, where a set of requests is performed in a row. Additionally, an agent can perform distillation of responses for adapting them to different clients. The functionality of a simple agent can be described as follows:

151 10.4 APPLICATION SUPPORT LAYER DESIGN Fetch the message from the queue. The messages are fetched using polling on the queue. 2. Unpack the request. Unpacking the request, we identify and save for later the ID which was set by the proxy. 3. Open connection to the server From the destination data in the request, we try to connect to the server. This action can result in two ways: 1) The connection is successful 2) The server can not be contacted (because of some errors). If the connection was successful we continue, otherwise we either create an error message or repeat depending on client request type. 4. Perform request. Here we just perform request and wait for the answer. In case of error, the proper error message can be created. 5. Accept the response. When the server sends the response the response is saved in persistent memory (in order to sustain the system crash). 6. Close the connection to the server (optional) Depending on the protocol, we can close the connection or wait for later requests. 7. Perform preprocessing (optional) The response from server can be processed using distillation processes. 8. Pack the response into a message. Then the response is packed into a message, which is marked with the message ID that was saved during acceptance. 9. Send the message using communication layer. The send is performed using asynchronous send primitive on the communication layer. In order to support the persistency and delivery guarantee of requests, we have to save the state in two places: 1. When receiving the message from the communication layer. At that point the communication layer is no longer responsible for persistency. The persistency control is overtaken by the agent which saves the message in a persistent memory.

152 10.5 SUMMARY - APPLICATION SUPPORT LAYER DESIGN When receiving the response from the server. After server has executed the request, the application support layer must guarantee the delivery of the response to the client. Consequently, the response is saved in persistent memory Summary - Application support layer design The application support layer is responsible for supporting applications by giving them an easy access to the communication layer. The access is given by offering a request/reply API. Additionally, the application support layer can support disconnection by incorporating a proxy that services requests from the cache. To execute requests remotely, the application support layer incorporates agents that execute requests on the fixed network side. The application support layer also offers an API to the application that makes it possible for the application to recover the pending requests after a client failure (or exit). The application support layer consists of: Application API is a standard URL and HttpConnection, expanded with API that controls proxy and agent s behavior. The control of proxy includes control of cache behavior. The control of agents includes control of agents action on certain response. For instance, it may not return the reply to the client before server responds. Agent accepts application requests, transports them to the server, fetches the response, and returns the response to the application. The underlying communication system is not visible for the application. Proxy is responsible for maintaining connections. It means opening and closing connections, managing many requests, transporting messages and informing application about request status. Every request is translated into a message that is send with the communication layer. To support many requests, the messages are identified with unique correlation ID. The proxy is general, but it uses a protocol specific agent. It chooses the proxy based on the used protocol. Cache is the extension of hand held part of the application support layer. Cache is used to service requests locally. Cache is a general cache that can be used for any internet type request.

153 10.6 APPLICATION LAYER DESIGN 141 Agent is the fixed network part of the application support layer, that executes requests on the client application s behalf. It acts a client for the server. Agent, in contrast to proxy, is protocol specific. It may be expanded with application logic, that may handle more complicated requests, or keep sessions alive during disconnections of the clients Application layer design The application layer is responsible for: 1. Interacting with users. We have chosen a form browser application. In this section, we will present the forms offered to users, and the actions that are performed when interacting with users. 2. Managing underlying layers. The browser type forms are just pages with links or forms that are transported using the underlying layers. In this section we will present how the browser manages the layers. 3. Controlling the application logic (server side) In this section, we will present the server design. How the responsibilities are distributed and how the business process is maintained Interacting with user The interaction with user includes two aspects: 1. Browser behavior - the design of how the browser reacts on user actions. 2. Form design and actions - the design of form and presentation of their behavior. Browser behavior The off-lie browser differs from the standard browser due to awareness of disconnection. An off-line browser includes additional functionalities supporting the disconnection. In a standard browser, the user interaction is reduced to filling out forms, issuing requests, and changing pages. Basically the browser shows only one

154 10.6 APPLICATION LAYER DESIGN 142 page and other pages are accessed by new requests. We have expanded the browser with the well known history option, where user can go back to the previews page without issuing a request. This is done by clicking on Back button. To support the disconnection awareness, the browser was expanded with an overview page, where all the unserviced requests are shown. From that page user can check if they have been serviced (if the data can be fetched) The user can access new pages in two ways: 1) By clicking on link and 2) By submitting a form. Both of this actions require involvement of the application support layer and will be discussed in Forms We have described the browsed pages using a XML based syntax. The XML based syntax is highly inspired by the HTML. The widgets include: text boxes, links and forms. The forms may include: hidden parameters, text input, radio buttons, checkboxes and push buttons. Expanding XML syntax with new forms and widgets shouldn t be complex. 4 Forms are submitted by clicking a submit button. While submitting all the parameters are collected and an URL including the parameters is created Managing underlying layers The browser controls the behavior of the underlying layers. There are three options that are controlled: 1. Request types. Links requests are the GET requests. Submitting forms can either be a PUT or POST request. The form parameter method defines the request type. 2. Cache control Submitting request the application decides the policies for the cache. We have chosen the following client policy for requests: Link request are the GET request, which responses are being cached. This is due to supporting the disconnection. Submit requests are all omitted by the cache. This is because many of the requests are requests that are used to submit data to the server which changes the server state. 4 In appendix A.1, we present an example of a XML-form

155 10.6 APPLICATION LAYER DESIGN Agent control We have chosen to use a simple agent that only executes the synchronous requests to the server. Additionally, the agent can be instructed to wait for the server response (it repeats if the server is not available). The server decides whether the agent should wait on the server response or not. The information is included in the link or form parameter AGENTACTION Controlling application logic Figure 10.6: Design of the server The server is responsible for controlling the application flow, and generating forms. As shown on figure 10.6, we have divided the server into application logic and administrative module. Administrative module is responsible for accepting requests of different types and translating them into general type of requests, which are sent to the server. The transformation is made by a protocol specific process which translates the request to the general form. Additionally, the administrative module has the responsibility of controlling access policies. Using the device DB, the administrative module receives the responses from the server and translates them into a device specific form. Application logic accepts general requests and performs the actions. The server returns general responses. The server includes only the application logic. The server logic is shown on figure The application logic includes:

156 10.7 SUMMARY - APPLICATION LAYER DESIGN 144 Figure 10.7: Design of application Logic (server side) 1. Merging procedure The merging procedure includes both the simple matching (having a task ID) and the more sophisticated matching based on data relations. 2. Task generator Task generator is the module responsible for assigning new tasks upon requests and generating responses. The response can either be new tasks or just acknowledgments on received tasks. 3. RIME cache Is used to store the data of not completed work processes. When the work process is finished the data are send back to the RIME 10.7 Summary - Application layer design The application logic consists of application on the off-line mobile computer, and of server(s) on the on-line side. The application design is based on a thin client model (section 5.1). In the thin client model, we reduce the code that is processed on the client and move its execution to the server. On the client side we only have a form browser (similar to a HTML browser), that displays forms and interact with users. Every form is generated by a server initiated by client requests. Every form is a text based XML description of a graphical widgets, that allow submitting data and making requests to a server. The form based approach opens a possibility to use any programming language. In contrast to forms, we could use serialized objects, but this feature is only supported in Java, and therefor not general. In a form, we specify some properties like: TTL - time to live (in how long time a form may be used, if it s cached)

157 10.7 SUMMARY - APPLICATION LAYER DESIGN 145 Cache - if it s allowed to cache it This incorporates the programming of proxy, from application support layer. Update type - if its a update of data or process server, RequestType - if a request should be standing, The design splits the responsibilities between the fixed network side, and the hand held computer side. On the hand held computer side, the browser is responsible for: 1. Displaying the forms and Interacting with user Every response from the server is sent as a form. The form, having a defined syntax, is parsed and displayed. The browser interprets user actions on the form, like submitting a form, filling a text box, or clicking on the link. 2. Generating requests and receiving responses. Every form may include requests to other forms. either be links or form submit actions. The requests can The applications execute requests, on the application support layer, and wait till it services them. Because of disconnection it may take long time before a request will be serviced. The browser is therefor responsible for maintaining disconnections. Disconnections can be handled by giving user a possibility to re-request. In case of having many non serviced requests, a list of them can be presented as a form page. The user can then see which of the out-standing requests has been serviced. 3. Navigating in forms Because the browser is often executed on a processor weak platform, the process of parsing and displaying a form may be time consuming. To optimize it, we may cache object in an application specific way, so they may be easy to access. This can be done by having application specific caching, for instance as a history object. On the fixed network side, the server is responsible for: 1. Executing requests Receiving a request from a client, the server executes the code that results in a certain action, and the generates a client specific reply.

158 10.7 SUMMARY - APPLICATION LAYER DESIGN Generating replies Every client request results in an answer, a form page. An answer can be a form, but can also be reduced to an information form. Server generating new forms, can control versioning, and time to live.

159 Chapter 11 Implementation In this chapter, we describe how we have implemented the designed system, and what is actually implemented of the design. The chapter will also describe the difficulties met during the implementation. We present the implementation in following steps: 1. Choice of programming language. 2. Unimplemented features. 3. System presentation 4. Source code from others. 5. How to start the system Choice of programming model/language We have chosen to implement the system in Java(JDK 1.2). On the hand held computer we have only used the CLDC subpart of the JavaTM 2 Platform, Micro Edition (a subpart of Java(JDK 1.2) for hand held devices) because only the CLDC implementations of the KVM virtual machine is available for the Palm platform. The choice on Java felt upon many reasons: 1. Portability Our aim was to make a general system, that could be used with any hand held devices. Additionally, a lot of functionality was the same for the clients and server, hence portability was very preferable. 147

160 11.2 UNIMPLEMENTED FUNCTIONS Popularity The other reason was growing number of tools made for Java. When we started, there were several toolkits and application development systems that could be used to create applications for hand held devices. We have also heard that many hardware producers had plans about integrating JVM 1 in their new products. 3. Easiness The last reason was the easiness of Java. On the Internet we could find tools and libraries for many purposes. Java has also a standard for application development for the internet. We have used this interface for our client server system in a modified version. We have also found a standard for massaging system build with Java (JMS). We have implemented a subpart for this specification and used it in our communication layer. The modules that we have implemented are easily portable (most of them should function without modifying the source code). Only some parts that are very device specific (like Palm GUI) need to be changed. These were, though, implemented in separated classes Unimplemented functions The implementation of our system is not complete. Because of time limits, we have focused on implementing the features that were necessary to present our system. The implementation differs from the design in the following ways: 1. We have not implemented the watchdogs (neither on client nor the server) The system of watchdogs was not implemented because, we haven t observed system hangs. Making a real reliable system, we should add this functionality. 2. We have not implemented a persistent queue on the server side. We believe that implementation of the persistent queue wouldn t change the overall system functionality nor performance. Our focus was placed on the client side, hence client application is reliable and recoverable on the modules. 3. We have not implemented the write functionality to the cache. 1 Java Virtual Machine

161 11.3 SYSTEM PRESENTATION We have not implemented the administrative module on the server System presentation We present the system in following sections: 1. Hand held side presentation. 2. Communication layer on the fixed network side 3. Application support layer on the fixed network side. 4. Application layer on the fixed network side Hand held side As presented in the design, the system is divided into client and server side. On every side we distinguish the system layers. In the presentation, we will only show the major classes and their responsibilities. Figure 11.1: Application on the client side As shown on the figure 11.1 the system was implemented in the presented way. Every object has a different functionality. On figure 11.2, we see the relations between the objects and the threads in the system.

162 11.3 SYSTEM PRESENTATION 150 Figure 11.2: Object and thread relationships. Container objects includes the other objects and threads. XMLApp is the main thread controlling the others. Container application (OfflineBrowser.ContainterApplication.java) is the object that initiates all the other objects. It keeps the references to the objects so they can communicate with each other. In future, it should also include a watchdog that will maintain the references and initiate the system modules. Browser (xmlapp.xmlapp) is the implementation of the browser. XMLApp has following methods: pendown intercepts events and calls other methods getpage code responsible for getting new pages. getpage starts a xmlapp.pollingthread that performs the asynchron requests to the application support layer. PollingThread uses URL, and HttpConnection and it communicates with appsupport.requestmgr using talk. When the re-

163 11.3 SYSTEM PRESENTATION 151 sponse arrives, the PollingThread parses the page into AppBrowser object. listrequests fetches the list of standing requests. It parses the returned stream and generates a list of URLs. ReqReplyMgr (appsupport.requestmanager) is responsible for maintaining the request and reply and connecting the messages RequestManager includes following methods: talk is the method used to communicate with the HttpConnection class. talk accepts following requests: OPEN, CLOSE, REQUEST, POLL, LIST; which then calls appropriate service methods. polljms is a private method that contacts the queue and checks if any messages have arrived. request services the requests. Firstly, it creates an openconnection field. Then it checks if a request can be serviced from cache (appsupport.cache). Then it issues a queue.send request. Queue (palm.queue) is the message container, where all the incoming and outgoing messages are stored. Queue offers the following methods: send puts the message in the persistent DB receive looks form messages in the DB delete deletes the message from the DB Queue uses persistent Vector (palm.vector) and the DB(palm.DBClass). CommMgr (palm.communicationmgr) is the object that encapsules the functionality of managing the queue while transporting messages. CommMgr has following methods: ClientRun is the method that runs the message exchanging process sendpackage/receivepackage sends/receives a simple package which is either an EMPTY package, or a DATA package. sendmessage/receivemessage sends/receives a message encapsuled in DATA package CommMgr is executed by a Synchronizer thread (palm.synchronizer) which opens the connection and runs the CommMgr.ClientRun. Used APIs:

164 11.3 SYSTEM PRESENTATION 152 URL is just an object that opens a connection to HttpConnection interface to the off-line HttpConnection that uses RequestManager. HttpConnection includes following methods: openconnection(url) opens a new connection openconnection(urlrecovery) opens a connection to a pending request. This method is used to get the waiting requests that were not serviced because of disconnection. setcachepolicy sets the policy for the local proxy (CACHE/DONTCACHE) setproxypolicy sets the policy for the agent (WAITFORRESPONSE/RETURNIMMEDIETELY) setapplicationid is used to identify the application getrequeststatus returns the status of message delivery (either READY/NOTREADY/ERROR) getinputstream returns the transported data RRMApi an API that gives the access to the non serviced requests. The RequestManager accepts a (LIST + appid) request. In advance, the talk returns a stream that includes the XML description of pending requests. Example: String req = "LIST "+applicationid; java.io.inputstream is = new java.io.bytearrayinputstream (req.getbytes()); java.io.inputstream retstream=this.owner.requestmanager.talk(is); The xmlapp.xmlapp.listrequests() is the method that parses the returned stream and gives a list of pending requests. AppAPI is an interface to queue, informing if there are any massages for a given application. This API was not implemented JMS is a queue interface, that allows sending/receiving message. The JMS interface is presented in A.6.3. CommMgrAPI is an extension of JMS that allows CommMgr an direct access to the queue. Instead of using JMS, the CommMgr manipulates the Queue directly by using send / receive and delete methods. AdaptAPI is an API for applications that wants to check whether we are on or off-line. Using this API application can force a connection. Socket is the interface to the underlying sockets (or other protocols)

165 11.3 SYSTEM PRESENTATION Communication layer on the fixed network side Figure 11.3: Communication layer s objects and threads on the server side. QueueServer creates Queue and two threads that wait for requests on the socket. The ServerQQ thread communicates with a queue on hand held device, and the ServerRQ thread communicates with remote clients. Communication layer on the server side is the implementation of MOM. As shown on figure 11.3, on the server side, we have a QueueServer object that initiates two threads that wait for client requests, these threads are ServerRQ and ServerQQ. When a request from a hand held computer arrives, the ServerQQ starts a process that handles the communication with the hand held queue. In the same way, the ServerRQ handles the request from queue clients, for example agents. All the threads share one queue object, where all the messages are stored. The CommMgr (pc.communicationmgr) on the central queue offers following functions:

166 11.3 SYSTEM PRESENTATION 154 serverrun is used to communicate with queue on the hand held computer. This method is called from the SererQQProcess. serverrqrun is used to communicate with a remote queue client. server- RQRun is executed from the ServerRQProcess. The CommMgr uses socket to transport data between queueserver and the clients Application support on the fixed network side The application support on the server side is a set of protocol specific agents. We have implemented an agent called HttpProxy that issues http requests to the server. The HttpAgent (apps.httpagent) requests the central queue. When a message arrives, it starts a HttpAgentProcess (apps.httpagentprocess) to handle the request. The HttpAgentProcess opens a socket connection to the server API The application support is accessed through the standard Java API, for opening network connections. The URL is used to specify the destination. The network connection is decomposed into 3 modules: 1. Protocol handler Is the specification of protocol 2. Content handler 3. Connection Application on the fixed network side As shown on figure 11.4, the server application is implemented using following objects: GatewayServer (gateway.gatewayserver) is the object that initiates all the needed parts. It keeps the references to the objects. RimePortal (gateway.rimeportal)is the object responsible for performing CORBA calls to the Rime. This object was not fully implemented.

167 11.3 SYSTEM PRESENTATION 155 Figure 11.4: Server objects and threads RimeCache (gateway.rimecache) is the main object containing application logic. It has following methods: setpatientresultdata (gateway.setpatientresultdata) is called by the EventHandlerThread, which intercepts new patient event from Rime. setrequisitiondata (gateway.setrequisitiondata) inserts a new Requisition in the DB setsampledata inserts a data from a nurse containing sample info getpatientresultdata (gateway.getpatientresultdata) searches for new patient results getnextunservicedrequisition (gateway.getnextunservicedrequisition) when nurse asks about new requisitions, this method looks for

168 11.4 SOURCE CODE FROM OTHERS 156 them ServletRime (gateway.servletrime) is the servlet responsible for performing client requests. First it translates the request into ProtocolIndependentData. Then it asks RimeCache to perform the proper action. The result from RimeCache is used by ClientMgr to generate the response. Then the servlet sends the response back. ServletForms (gateway.servletforms) is the servlet returning the simple forms. EventHandler (gateway.eventhandler) is the thread that handles events from Rime. It is started by EventHandlerPoaImpl object, which is registered in rime by RimePortal. ProtocolIndependentData (gateway.protocolindependentdata) is the general data type. Http requests and Corba events are transformed into this type. ClientMgr (gateway.clientmgr) is the object responsible for generating the proper responses. ClientMgr uses the Form object, where forms are saved. Forms (gateway.forms) is the object containing the application forms Source code from others We have implemented most of the system our selves. However, we have used some source code from others to some extend. We have modified the open source libraries for the JAVA internet-type clientserver model (that is code for constructing http-request on the client application ) so that it can be use asynchronously. Among other things we have added a test primitive to the HttpConnection object called GetRequestStatus that we use to contact the proxy and poll for the response. We have also modified the library so that it only uses the CLDC subpart of Java(JDK 1.2). On the handheld computer we have only used the CLDC subpart of Java(JDK 1.2) because only the CLDC implementations of the KVM virtual machine is available for the Palm platform. We have implemented all other components of the system our selves. In our implementation we have used the following libraries: 1. IBM s implementation of kjava (A GUI library specified by Sun Microsystems Inc).

169 11.5 SYSTEM START UP kxml version 0.8 XML-parser library ( 3. Jakarta-tomcat (Servlet engine and XML-parser on the fixed network side) ( 4. In addition we used the XML database DBXML version 0.6 ( as a repository in our server System start up Figure 11.5: The implemented system On figure 11.5, we show the whole system. In order to start the whole system, we have to start the fixed network side and set up the Palm application. In the following we will present how to: 1. Start the system on the fixed network, and 2. Set up the Palm computer Starting fixed network system 1. Starting the asynchronous MOM. Start the QueueServer (this will enable the ) 2. Starting the HttpAgent. Set the offlinejms.consts.queuename to be the same as the Queue- Server IP. 3. Starting the application To start the server application:

170 11.5 SYSTEM START UP 158 (a) start the XMLDB XMLDB is the database used by the GatewayServer. (b) start the RIME RIME server is needed to register the event interceptor in the GatewayServer (c) start the GatewayServer In the GatewayServer make sure that gateway.gatewayserver.servletformurl and ServletRimeURL are set correctly and point at Gatewat- Server s IP. Additionally, the AP has to be checked if it can connect with hand held devices and if the given IPs are accessible from the AP Starting palm application Before starting the browser (XML Browser), start the XMLPrefs. In XML- Prefs set: QueueName to the QueueServerIP AgentName to the HttpAgentIP LocalIP to the IP of the hand held device Then, the XMLApp can be started.

171 Chapter 12 Test In this chapter, we will present the test strategies for our system. The main test for our general client-server system is the hospital work-flow application that demonstrates the whole system. However, we have made some additional tests. Because we want to present the system during different conditions we split the test into sections, where different conditions will be simulated. We structure this chapter in 4 parts: 1. General test setup 2. Performance test 3. Disconnection test 4. Reliability test 12.1 General test setup Hand held computer We used two SPT 1700 Hand held computers from Symbol Technologies Inc. They where equipped with Specturm24 wireless network connection supporting data rates between 2 and 11Mbps (Symbol Spectrum24 IEEE airwaves standard compliant/ieee b, S24 Driver Version 11M v3.00, S24 Firmware Version V ) The CPU was Motorola DragonBall 68EZ328 (16 MHz). The device was equipped with 2MB RAM and 2MB ROM. The operating system was PalmOS v The device came with an integrated bar bode scanner (Integrated SE 900/SE 900HS scan engine). 159

172 12.2 PERFORMANCE TEST Wireless AP The wireless AP was a Spectrum24 High Rate AP 4121 Access Point. The data rate supported was 11Mbps (IEEE b) Stationary computer Standard PC with a 700MHz CPU and 256 MB RAM. The operating system was Windows 2000 Professional Network Standard 10Mbps Ethernet Software On the hand held computer, we used Java Virtual Machine J9 version 1.4 from IBM. In addition, we used Rime from Radiometer Medical release 7 ver 2.1. We used the XML database DBXML version 0.6 ( as a repository in our server. Jakarta-tomcat as a servlet engine ( When the hand held computers access the network through a serial connection (Cradle) we use a null-modem and the RAS-server from Windows Performance Test As shown on figure 12.1, servicing a request includes steps 1 to 16. Firstly, the message is transported to the local Queue (steps 1 to 4), and then it s fetched by CommunicationMgr which transports it to the global queue (steps 5, 6 ). HttpAgent fetches the message from the global queue (step 7) and executes a request against the server (step 8). Then the httpagent returns the server response to the global queue (step 10). The message is transported back to the local Queue (steps 11, 12) and fetched by the Request Manager (steps 13,14). Finally, the message is transformed into data which is received, parsed and displayed by the browser (steps 15 and 16). The total time is the time from the user action initiating the request, until the response page is fully displayed.

173 12.2 PERFORMANCE TEST 161 Figure 12.1: The round trip of a request. Every request includes steps 1. to 16. We perform the test by examining a total round trip for a request. Most of the requests is performed only once. There are some steps that are repeated, though: Step 2 is the polling from the application that checks when the response arrives. Depending on policy, we may poll all the time or sleep between polling. Step 5 is the polling to global queue that checks for now messages and transports the messages waiting to be sent. Step 10 is the agents polling for the messages to it.

174 12.2 PERFORMANCE TEST 162 Figure 12.2: Request times for different requests. Every request results in a response. The size of the response is indicated by the name of a form General time distribution On figure 12.2, we see the round trip time (service time) for requests. The service time stretches from 22 to 47 sec, depending on the response time and size. In general, the time for servicing requests includes 70% on the hand held side, and 30% in transporting the message to and from the global queue, where it is serviced. Request time takes between 0.5 and 1.5 sec (0.9 in average) This time includes placing the request in the underlying layers and fetching it back. It includes poll request time and sleep time between polling. The first request (login) is longer, because the request time includes the initiation of the application support layer s objects (objects between step 1 and 5 from figure 12.1). The average time of the first request is expanded with approximately 10 seconds, and takes around 11 seconds. The later polling execution takes around 0.9 sec. Time in MOM is the time of a message waiting in the queue. This is the time between step 4. and 5. and later between step 12. and 13. In average, the message waits for 2 sec each way. Then it is either fetched by the Communication Manager or it s fetched by the Request Manager.

175 12.2 PERFORMANCE TEST 163 Time of transport in MOM is the time after a message is fetched from the Queue in step 5. and it is delivered back in step 12. In average, this time takes 12 sec. Parse time is the time of processing the response. parsing and generating screen widgets. Processing includes It takes between 3.5 and 15 sec to parse the response (depending on the response size) Display time is the time it takes to displayed the widgets that were generated by the parser. Display time is small (under 1 sec). Now, we will look at the single time groups and examine where the delays occur Request Time Because the request time includes polling and synchronizing, we have examined whether different polling and synchronizing techniques influent the total request time. Figure 12.3: Request times with variable polling time On figure 12.3, we see the request time for different requests, depending on the application polling time. We can conclude that having 0 sleep time for polling thread, reduces the system performance slightly. Polling every 2 sec is optimal. The difference between continuous polling and scheduled polling can reach around 8 sec. Polling too seldom, increases the time of the the response message waiting in the queue for being fetched by the application.

176 12.2 PERFORMANCE TEST 164 Sending a request to the queue takes in average 5.8 sec, and fetching it takes in average 7.8 sec. Figure 12.4: Request times with variable synchronization time. On figure 12.4, we see that request time is not dependent on the synchronizer sleep time. The average waiting time for a message in queue is between 1.95 sec (for Sync sleep time = 0) and 3.39 (for Sync sleep time = 2 sec). Request time and response message size In table 12.5, we can see the relation between the response size and the time of fetching the message from the queue. It takes approximately 1 sec to transport 100 bytes from the queue to the application. Conclusion In general, we can conclude that the request time is not dependent of the system load of threads. Too intense polling reduces the performance slightly. The most time consuming is object creation and string operations. size of the response (bytes) time of fetching the message (sec) 6,17 4,63 11,38 7,92 10,98 3,28 Figure 12.5: The relation between response message size and time of fetching the message

177 12.2 PERFORMANCE TEST Time in local MOM Messages stay in MOM for around 2 sec. This time includes the time of transferring a Message object into bytes, that are then copied to a palm DB. Even when synchronizing and polling constantly the message stays in the local queue for 2 sec. Figure 12.6: The service time of the request on the fixed network side. The service time on the server side is approximately constant. The time of receiving the message and placing it in the local queue is dependent of the size of the response message Time of transport in MOM The time of message transport in MOM includes two steps: 1. Sending the message 2. Receiving the message After a message is sent, the httpagent fetches it and unpacks the request. Then it executes the request against the server. As shown on figure 12.6, the time of executing the code on the httpagent takes approximately the same time for every request (in average 4.68 sec). The time of transport in MOM depends on the response message size.

178 12.2 PERFORMANCE TEST 166 Conclusion The time of transporting the message in MOM (from fetching it from the queue till it is placed in the local queue again) is mostly dependent on the size of the response message. We can conclude that it is not the bandwidth that is limiting. The speed of accepting the bytes is in average 100 bytes/sec = 0.8kbps (This is much lower then any network connections limitation) Figure 12.7: The relation between message size and the parsing time. The parsing time is related to the size of the response message Time of parsing and displaying On figure 12.7, we can see that parse time is related to the response size. It stretches from 3.25 sec (for 319 bytes) to sec (1119 bytes). Additionally, the parse time didn t change due to different schedule policies (sleeping time for threads). Again, we can conclude that object creation while parsing is the critical time constrain Servicing from cache As shown on figure 12.8, the average service time for requests differs between connected and disconnected state. While connected, the average service time is 19 sec, and during disconnection (when the communication thread is not running) the average time shortens by 7.5 sec and falls to 11.5 sec.

179 12.3 DISCONNECTIONS 167 Figure 12.8: Request s service time from the cache. Servicing from cache during disconnection is faster then when we are connected. The service time from cache is 17 sec smaller then servicing from network (in average 36 sec). Conclusion Servicing from cache is clearly faster then servicing from network. When the hand held computer is disconnected, the service time is lower. While being connected, gives an additional delay of 7.5 sec Disconnections In the disconnection test we examine system behavior while disconnected. We test the disconnections in three states: 1. When transporting the request to the global queue 2. When transporting the reply from the global queue 3. During the transportation

180 12.4 RELIABILITY TEST Transporting message to MOM To test the disconnections when transporting messages, we disable the central queue, so the hand held device can not connect. We perform several requests that time-out giving the info page that the requests couldn t be serviced. When starting the queue, all the waiting messages get transported and serviced. We conclude that transporting requests to the central queue is handled correctly during disconnections Transporting message from MOM We disable the httpagent, so it does not create any response messages. Then, we perform a few requests. We see that request messages are transported to the global queue. Then we disconnect the Palm, and start the httpagent. HttpAgent services the requests and returns the response messages. After re-connecting we see that all the messages are transported correctly to the hand held computer. We conclude that disconnection when transporting messages from the global queue to the local queue are handled correctly Disconnecting during transportation We have tested the message delivery by disabling the acknowledge on the receiver side. The system behavior was correct and the message was not deleted until the acknowledge was received. Too long delay resulted in socket time out exception, which has moved the system into disconnected state. We can conclude that the reliable protocol works correctly and no messages are lost. Exceptions on the socket are caught by the proper layers, and they are used to close the (broken) connection Reliability test The reliability test is only performed on the hand held computer side. We run the following tests:

181 12.4 RELIABILITY TEST System crash before messages are synchronized 2. System crash after messages are synchronized 3. System crash during synchronization Crash before synchronizing We perform a list of requests while disconnected. The requests are waiting in the local queue. Then, we provoke a crash by resetting the hand held computer. After restarting the program and re-connecting, the messages are transported to the global queue. Concluding, the system can stand the crashes without loosing messages Crash after synchronizing We send the request messages to the global queue. messages, we provoke a system crash. After receiving the When system restarts, we are able to access the unserviced requests and fetch the received responses. Conclusion is that the system can stand the crashes of the hand held computers after receiving the responses Crash during the synchronization The last test is provoked during delivery of messages. When a message is being transported, we restart the hand held computer. After restarting the application, we see that messages are are either still in the local queue or they have been transported. Some times, the messages are duplicated because we guarantee at-least-one delivery of the messages. Concluding, the system handles crashes during synchronization and does not loose the data.

182 Chapter 13 Related Work In this chapter, we describe how our system is distinct from related existing commercial products and experimental systems in the literature. Our project addresses problems from different groups: 1. Object oriented systems 2. Data server 3. Thin client models 13.1 Object oriented systems Rover Rover is an experimental framework for application programming for mobile devices[joseph 1997]. The system offers asynchronous RPC ( queueable RPC ). The system also offers the possibility to download objects, modify these and write them back to a server. The relocatable objects are used to reduce client-server communication and to support disconnected operation. during disconnections, changes made on the local object copies are logged and synchronized when the computer reconnects. A proxy on the mobile client and an agent on the server side is used as part of the design. The communication between proxy and agent is separated in a special communication layer. In order to demonstrate the system, the experimenters have implemented a number of applications. One of these is a click-away off-line browser. Web-pages that are ready to be view can be found in a special page. 170

183 13.2 DATA-SERVERS 171 Rover is an extensive system that has also been extended to support reliability [Joseph 1996]. Our system has many similarities with Rover, although the later is considerable more extensive and complete. The main difference is that our system has been designed specifically for small hand held computers whereas Rover is a complex system that support mobility in general. Rover would be a heavy load on a hand held computer. In addition, our client-server interface is not a RPC, but an internet type client server model (For example Http requests). Finally, our communication layer is a general MOMS the is directly accessible to the applications CORBA CORBA release 2.4 late 2000 includes a specification for asynchronous invocations (Asynchronous AMI invocation model([siegel 1999])). To our knowledge, there is no implementations that support this feature, yet. The system is characterized by: A flexible asynchronous programming model (call-back or polling) A communication model. A Massaging architecture that cleanly separates the asynchronous programming model from the over-the-wire communication model By using an agent called a router on the server-side, the server (or server ORB) can be invoked synchronously. A router on the client-side stores and forwards asynchronous CORBA invocations to the server-side router. The specification supports reliability by allowing routers to use persistent storage for the store-and-forward mechanism. CORBA is considerable more complex than our client-server system but is structured in the same basic layers. However, CORBA does not support an option for caching of previous requests Data-servers Coda Coda is a distributed file system designed for disconnected work [Kistler 1992], [Satyanarayanan 1993], [Mummert 1995]. In Coda, the client caches data with a granularity of whole files. When the client is disconnected, a local proxy acts as file manager on the cached

184 13.2 DATA-SERVERS 172 files. Changes made on local copies are logged. When the client is connected to the network, changes are synchronized with the central file server. Simple, system specific file update conflicts made by concurrent clients can be handled. Cache-misses on the client are treated as errors. There is a considerable support for user specified hoarding of data. Coda focuses on providing a data server (file-server). We focus on a general client-server system Bayou Bayou is a general database system designed for mobile environments ([Douglas 1995], [Edwards 1997], [Petersen 1997], [Terry 1998]). In order to ensure high availability, Bayou replicates the database on mobile clients. An optimistic strategy is used for data access. Clients are allowed to change data in any replicated database at any time. The replicated database servers synchronize pairwise and the changes propagate the change logs epidemically to all replicas off the database. The big contribution in Bayou is that the projects stress that conflicts should be dealt with, in an application specific way. Bayou puts the responsibility to solve conflicts on the application by letting them define algorithms that Bayou uses to solve the conflicts. In practice, it has proven difficult for users to define conflict resolution algorithms unless application semantic is very simple (For example a Calender database). Bayou is usable for work stations and when the database is small. If the database is large and the clients can not be trusted, the Bayou-design can not be used. In our system we can handle synchronization of changes and handle update conflicts in much the same way as Bayou. This means in practise that we put the responsibility and work on the server-programmer to define how this should be done. We have chosen not to go into details with synchronization problems, because projects like Bayou have clearly shown that general support is not likely to be successful Oracle Mobile Agents and Oracle Lite Oracle offers commercial products based on agent designs and proxy designs. Oracle Mobile Agent is an example of an agent-design that allows asynchronous database updates[oracle 2001a]. Oracle Lite is an example of a local proxy-design [Oracle 2001b]. A light version replica of the database executes on the client and synchronizes with a remote central database.

185 13.3 THIN CLIENT SYSTEMS Thin client systems The thin client systems are the systems where most of the work is executed on the server. Web based systems are also included in this group WebExpress WebExpress is a support system for web-browsers that use Http-specific technics for optimizing communication over wireless network connections [Housel 1996]. The system has been extended to support disconnected operation by offering asynchronous browsing and access to a cache [Chang 1998]. A proxy server is used to intercept http-requests (TCP/IP connections) from browsers and the (http) request packages are send over the network with a protocol optimized for wireless network connections. A server side agent receives the request and execute these against the web-servers. Web-express does not focus on reliability. In addition, the system is not a general purpose client server system but it focuses specifically on browser s needs. The system does not offer the different layers of client server programming interfaces that our system does Gate-way solutions There is a number of agents (Gateways/portals) based solutions for giving hand held computers access to the WEB servers. KBrowser is a cross-platform Internet WML microbrowser that provides access to WAP-based (Wireless Application Protocol) services. KBrowser is implemented in Java and runs on the KVM virtual machine. KBrowser can use agents to access other types of servers than WAP-servers. For example, the distillation agent BabelServer ( can be used to access the web-servers on the internet W4 W4 is an experimental WWW browser system dedicated for wireless networks. [Bartlett 1994], [Bartlett 1995] In this system, the web-browser executes on a server. The hand held computer only receive screen updates and send key-strokes. The server has a virtual screen of the client. The server can handle different kinds of clients by using different virtual screens. ProxyWeb is an example of a commercial products that use the same solution.

186 13.3 THIN CLIENT SYSTEMS Citrix/VNC/PCanywhere There is a number of commercial and freely available pure thin client systems that can be used with hand held computers. The basic problem with these systems is that they are sensitive to extended disconnections because they only send screen updates to the client.

187 Chapter 14 Conclusion We have succeeded in building a general client-server system that is executed on hand held computers and supports disconnected work. Using the general client-server system, we have built a thin client-server system, which consists of an off-line browser and a form generating server. Using the thin client-server system, we have implemented an application that supports a work-flow in a hospital. This application demonstrates that our system is realistic solution for hand-held computers. Our hospital work-flow application cooperates with the RIME 1 system marketed by Radiometer Medical A/S. We present the conclusion in four sections: 1. Conclusion on main problems. 2. Conclusion on project goals. 3. Presentation of the system. 4. Future of the system Main problems The system is the combination of the solutions to the four main subproblems from the problem formulation: 1. Limitations of hand held computers. To overcome the general limitations of hand held computers, we have chosen a thin client model that is a form browser. The application processing is executed on the server. 1 Radiometer Instrument Management Engine 175

188 14.1 MAIN PROBLEMS 176 Additionally, we have chosen to limit the code executed on the hand held device to a minimum. By having a client-proxy-agent-server design, we can push many responsibilities to the agent and the server, on the fixed network side. Client and proxy are kept very simple. 2. Supporting disconnection. Our system supports different kinds of disconnections: When dealing with short disconnections, the system can handle them by replacing the communication protocols. When dealing with longer disconnections, the solution may be to use asynchronous communication primitives. The asynchronous primitives can be implemented using MOM system. The asynchronous communication model can be adapted to the synchronous model by using an agent which translates asynchronous messages into synchronous requests. When dealing with long disconnections and fully disconnected operation, we can use a proxy that uses a cache. Because we want the proxy to stay simple, it only services objects from the cache. If the proxy can not service a request, it simply forwards it to the server. Extended periods of disconnections can be supported by downloading server objects. For completeness, we have analyzed how we can implement write-backs of locally changed objects. However, we believe that support for object s write-backs is a too high load for the hand held computers to be realistic. In addition, we did not need this feature in order to make the hospital work-flow application. Consequently, we have not implemented any support for write-backs. The system can handle disconnections transparently for user. Processes responsible for transporting messages take care of opening and closing the connection without interfering the user. 3. Ensuring system reliability. To ensure the system reliability, we need to ensure the data on the computer, and data while being transported. To support failures on the hand held device, we store the data in a persistent memory. A persistent copy of data can be used to sustain soft crashes. After a crash, a module can reestablish its state from before the crash. The problem of data transportation can be solved by using a reliable protocol combined with store and forward technique. In case of network error, we can synchronize data at a later time. The data are

189 14.2 PROJECT GOALS 177 moved from one persistent state to an other, and they are not lost in between. To wake up the crashed modules, we introduce a pair of watchdogs, where one watches the system and other watches the watchdog. The watchdogs can keep the system operational. 4. Building a work flow application. Using the thin client system, it was relatively easy to build an application supporting a work-flow in a hospital. The only problem was to collect and synchronize the data from concurrently running processes. We have concluded that the problem of merging activities in a workflow is not special for disconnection. Any systems that allow concurrent work and do not communicate data between the activities face the problem. The disconnected systems are more prone for concurrent execution without task communication. Because of lack of communication, we may be forced to transport the data externally between the activities, for example by agreeing orally on a merge key. An other solution is to find semantic dependencies between the data and create merging algorithms that use the dependencies when the data arrive to a merging process Project goals We have shown that building a system that is reliable, executes on limited device, and supports disconnections is possible. Thereby, all the goals of our project were fulfilled: 1. We have made an extensive literature search and found solutions for the different subproblems. 2. We have found solutions for the four problems and used them to assembly the whole system. 3. We have designed a modular system which is open, and can be easily expanded. 4. We have implemented the system and the prototype using it. 5. We have performed test of the system. The main test is the system itself. The system is usable and it cooperates with RIME.

190 14.3 THE SYSTEM The system The system that we have designed and implemented has three layers and a watchdog: 1. Communication layer Is a layer that implements asynchronous primitives, and ensures reliable package transportation. 2. Application support layer This layer offers an off-line request-reply protocol for the applications. Additionally, it supports the off-line application with a proxy and agent capabilities. Proxy has a cache, that services requests during disconnection. Agent is the on-line part, that performs requests on behalf of the mobile computer. 3. Application layer Is an example of an off-line programming framework, and is a guide in how to use our system. 4. Watchdog Is an additional process that controls the system reliability by watching the system layers. We have implemented a prototype, that demonstrates how to use the general client-server support system. The prototype is supporting a work-flow in a hospital. Designing and building the system highlighted some other conclusions, which we will present in the following sections Proxy Using a proxy has shown to be useful on a small mobile devices. By expanding the simple and general proxy with a cache, we could omit heavy, fully functional server replicates. We have designed a client-server proxy that supports reads and writes on data objects. The proxy is simple enough to be executed on a hand held device, it is also general because it can service any client server application using internet type requests. We could avoid the use of fully functional server replicates because: 1. The general proxy supports simple write backs. Because the application takes care of handling the writes, the proxy is functional enough.

191 14.3 THE SYSTEM We use a unique hash-code based on URL, to identify cached objects. In contrast to an application specific cache, our solution make it possible for applications to re-use other application s data, which may be preferable during disconnections. Therefore, we only need one general cache that can handle all applications using internet type requests Agent system The use of agents showed up to be a very flexible design. The agent s main responsibility is to transform asynchronous requests to synchronous ones. Because agents execute on fixed network, they are not as limited as the hand held computers. Therefore, they can be expanded with additional functionalities. Agents can stretch from simple repeaters (like ours) to more sophisticated agents that may perform series of requests on behalf of the application. They can be used to keep sessions, that would die in case of disconnection. We have only implemented a simple agent, that executes single requests. It was expanded with the functionality of re-requesting a server that is not ready with an answer, because of for example lost connection Disconnected thin client application An off-line system can be designed using a relatively simple framework. Having a disconnected browser we only need to consider two aspects: 1. Form design. Using forms makes it easy to make low power consuming applications, that may run on thin clients. We have designed the forms, so they suite to the small sized screens on Palm devices. Additionally, the forms can be adjusted to different devices by having requests that are device specific. Forms are generated on the server side, and in this way the application may be easily updated. Server also takes care of adapting the responses to the specific devices. 2. Server logic. Running the main application on the server gives flexibility in expanding the system. No clients need to be updated.

192 14.4 FUTURE OF THE SYSTEM Portability We have implemented the whole prototype in Java. Additionally, we have used open standards and APIs on the different layers. These standards are: Java is well known to be a multi-platform programming language. Apart from specific GUI for Palm OS, the code is portable to other devices. JMS has given the flexible design in message processing. It is open for implementing all the necessary primitives like: call back, blocking/nonblocking sending. Java internet type client-server model The URL and HtpConnection classes from Java s internet type client-server model has showed very practical due to its API, splitting protocol, content and connection. All of them can be combined freely Performance We have observed that processing power of CPU, is the critical factor in the performance of our system. To service a request takes in average 36 sec. (19 sec when servicing from the cache), and over 70% of the time is spent on the hand held device. Our tests have shown that the most time is consumed by the object creation and string operations. We don t see any design problems that may influent the system performance. The low performance can be partly explained by the virtual machine executing on a slow computer. An other explanation is a not optimal implementation. The bandwidth had no impact on the overall performance. This is mainly because the messages we sent are quite small, and they can be transported fast. The major work is in processing the messages by creating objects and storing them in persistent memory Future of the system The growth of systems for hand held and mobile computers is remarkable. During the six months we have worked on this project, a number of new software tools for hand held computers has become available. The development in the field of new hardware platforms is especially interesting: Hand held computers are becoming increasingly stronger with respect to memory and processor power. The hand held computer we have

193 14.4 FUTURE OF THE SYSTEM 181 been working with (SPT 1700 from Symbol Technologies) is in the absolute lower end of the devices, with respect to memory and processer power. Hand held computers are also equipped with better network connection. The hand held computer which we worked with comes with a 11Mb/s wireless LAN network connection. There is a clear tendency for a convergence between 3.G mobile phones and hand held computers. Urban network coverage is not so far away. We believe that the system we have built is very useful even with high coverage of high bandwidth wireless network connections. Support for disconnections is useful for securing a high level of availability. Support for disconnections is thus a reliability measure. In addition, we believe that our focus on general reliability will be useful in many practical systems.

194 Appendix A Appendix A.1 XML This is an example of a XML form that can be displayed by the browser application. <a href="gatewayserver.servletformsurl?name=robert">click her for Name=Robert</a> <form name="form1" action="gatewayserver.servletformsurl"> <input type="text" name="fname" value="robert">first name</input> <input type="text" name="lname" value="bialek">last name</input> What kind of vehicle do you have: <input type="checkbox" name="vehicle" value="car"> <option value="bike">i have a bike</option> <option value="car">i have a car</option> <option value="bus">i have a bus</option> </input> <input type="radio" name="vehicle2" value="car"> <option value="bike">i have a bike</option> <option value="car">i have a car</option> <option value="bus">i have a bus</option> /input> <input type="submit" name="end" value="ok">ok to submit</input> <input type="cancel" name="end" value="cancel">cancel</input> End of form </form> i

195 A.2 RIME INTERFACE ii A.2 Rime interface Rime function as a data manager in Radiometers hospital IT-system. Each apparats can contact Rime and deposit measurement data (Blood measurement data and calibration data). Rime also contains demographic data like basic patient information and personnel information. Requisitions 1 can also be registered in Rime. Rime is implemented as a CORBA system and offers CORBA interfaces. It is possible to read and write many data structures through this CORBA interface. It is possible for clients register in Rime to receive different events. For example: When apparats deliver measurement results it is possible to get a notification. The client can subsequently fetch the measurement result if it is interested in the particular result. Because Rime can be accessed through a CORBA interface it is possible to implement clients in many different programming languages. We use JAVA clients to communicate with Rime. A.3 URL addressing Internet type request-reply models, addresses object on the internet using URL. W3 2 has chosen to use URL(Uniform Resource Locator), as the addressing method on the internet. An URL, is as it s names indicates, a uniform way to locate resources on the Web. Apart from the object placement it informs about object type and protocol, that is used to communicate with the object. In general, URL has the form: protocol://hostaddress:port/objectlocation/objectname.objecttype Where protocol is the protocol used to communicate with the object, hostaddress defines the internet address of the host holding the object, port is the TCP port number, objectlocation is the location of the object on the host (often directory path relative to Web server s root), objectname is the name of the object and objecttype defines the type of object. Some examples of URL: 1 also called test-orders 2

196 A.4 DESIGN OF THE INTERNET TYPE MODEL IN JAVA iii ftp:// :8080/files/thefile.bin In Java s internet programming model the URL is encapsulates in an object (URL) and is used to access the other object in the model. A.4 Design of the internet type model in Java As seen by the URL format, the protocol used to communicate with an object can be separated from the encoding for the object itself (objecttype)(eg. if we want to fetch the object, the object will just be bytes to the protocol that transports it). The protocol module can in turn be divided: The connection used to communicate is in part related to the protocol, but the same protocol could be used over many different types of connections. These insights have led to a design, where the whole system is divided into three main parts: URL-object encapsulates Web objects by their URL addresses and is used to access other objects in the model. Protocol Handler object is divided into: 1. URLStreamHandler that parses URL and chooses connection. 2. URLConnection, communicates and chooses Content handler. Content Handler reads data stream and constructs object.(mime type handler) It is sometimes usable to read data streams and construct objects in the program using the system. To facilitate this, it is possible to work with the system without the contenthandler. The http-protocol is often used. In this case the HttpURLStreamHandler and HttpURLConnection objects are used. The HttpURLConnections opens a socket using TCP/IP to service the requests. A.5 Components in asynchronous systems In general, asynchronous systems are implemented by offering:

197 A.5 COMPONENTS IN ASYNCHRONOUS SYSTEMS iv 1. Memory to store requests until they can be serviced. 2. An Interface with asynchronous primitives. 3. System support that issues request on behalf of the application. Memory to store requests until they can be serviced Because of possible disconnections, we have to store client requests in memory, whereafter the control can be returned to the application. Request parameters are kept in memory as long as a process, responsible for issuing them, will not make a request. This memory can be some low level shared buffer memory or be abstracted to a form of a mailbox or a queue-system, which often is the case. Using a queue, as a memory model, gives a lot of possibilities to manipulate the stored data. For instance, different kinds of order guarantees (eg. FIFO), or priorities can be offered. There are some basic problems connected to the implementation of these systems. First of all, memory has to be managed. Secondly, there might be a synchronization problems if the client application and the issuing process access the same memory. An Interface with asynchronous primitives There are two main types of asynchronous primitives: 1. Polling based When polling is used two primitives are usually offered: The request primitive and a test primitive. The request primitive returns control immediately after it has delivered the request to the system support. The test primitive can subsequently be used to poll for the response. Sometimes the request primitive takes a time parameter. In this case the request blocks in the time interval specified and returns with the response if it arrives within the time interval[tanenbaum 1995]. 2. Call-back based When callback is used, one primitive is usually offered: A request primitive that takes a reference to a callback object as parameter. The system support invokes the callback object when the response arrives.

198 A.6 MOMS v System support The system support ties everything together. It should have the capability of issuing requests, when it is possible, and making the response available to the client application, when the response arrives, or running the call-back procedure, if a call-back mechanism is used. asynchron.tex A.6 MOMS Message-oriented Middleware (MOM ) systems can be used to build reliable and flexible distributed applications. In the following we will discuss MOM systems in general and describe Java Message Service (JMS) developped by Sun Microssystems. JMS is a specification for a consistent API that gives developers access to the common features of many messaging system products. A.6.1 MOM systems in general MOM systems can be used for communication between loosely coupled components through messages. MOM systems allows separate, uncoupled applications to reliably communicate asynchronously. The message system architecture often replaces the client server model with a peer-to-peer relationship between individual components, where each peer can send and receive messages to and from other peers. However, as demonstrated in this project client server systems can be build on top of MOM systems. Message-oriented middleware (MOM) [Steinke 95] resides in both portions of a client/server architecture. Figure A.1: Message Oriented Middleware

199 A.6 MOMS vi As shown on figure A.1, MOM is typically placed in top of the network and transport layer. Messages are stored in queues. The message queues provide temporary storage when the destination process is busy or connection is lost. As soon as a message get transported from one queue to another, it is being removed from the temporary storage. 3 MOM is typically asynchronous and peer-to-peer, but most implementations support synchronous message passing as well. 4 The MOM system typically support different properties like time to live, error handling, encrypting. Additionally, MOM can easily be made reliable by saving messages in persistent storage and using reliable protocols. 5. MOM makes it easy to exchange the underlying communication layers, like transport layer, and network layer without making it visible for applications. MOM can be very practical in systems with fixed network where only stationary computers are used. Because asynchronous systems do not demand the client and server to be connected simultaneously, they are very good for making reliable distributed systems in general. A.6.2 Types of MOM systems There are two common types of MOM systems: Point-To-Point: In point to point messaging systems, messages are routed from a producer application to a consumer application. The producer application send messages to a specified queue, and consumer applications retrieve messages from the queue. Publish/Subscribe: A publish/subscribe MOM system supports an event driven model where message consumers and producers participate in the transmission of messages. Producers publish events, while consumers subscribe to events of interest, and consume the events. Many consumers can receive events from the same producer. In practise the producers associate messages with a specific topic, and the massaging system routes messages to consumers based on the topics the consumers register interest in. 3 There exist many MOM systems, that are not compatible with each other(mmqs (Microsoft), JMS, MQseries) 4 The asynchronous mechanism of MOM, unlike Remote Procedure Call (RPC), which uses a synchronous, blocking mechanism, does not guard against overloading a network. As such, a negative aspect of MOM is that a client process can continue to transfer data to a server that is not keeping pace. Message-oriented middleware s use of message queues, however, tends to be more flexible than RPC-based systems, because most implementations of MOM can default to synchronous and fall back to asynchronous communication if a server becomes unavailable [Steinke 95] 5 The discussion about different reliability methods will be presented in chapter 7

200 A.6 MOMS vii A.6.3 The Java Message Service (JMS) Java Message Service is part of the Java 2 Enterprise Edition (J2EE) suite. JMS provides a standard API that Java developers can use to access the common features of enterprise message systems. JMS supports both the publish/subscribe and point-to-point models ( The basic idea with JMS was to provide a consistent interface that massaging system clients can use independent of the underlying message system provider. MOM system vendors has the responsibility to provide implementations of the JMS API for their products. A.6.4 JMS in detail JMS define some basic abstractions: Message Connection Session Message consumer and message producer. Message The message is the center of the system and contain the necessary information for routing and interpretation. JMS messages are designed to cover many different vendors message structures. The JMS message consists of three parts: Message header. Used for identification Properties. Used for application-specific, provider-specific, and optional header fields Body. The body that holds the content of the message. There are different message formats. The two most important format are text messages, which wrap a simple String and ObjectMessages, that wrap arbitrary Java objects (which must be serializable). The message object offers a number of methods for manipulating the message.

201 A.6 MOMS viii Connection When using a message system that offers a JMS interface we start by creating a connection to the messaging system. The Connection object created using a ConnectionFactory, which is typically located using a naming service like JNDI. The implementation of the connection is vendor specific and hidden for the user (TCP/IP is one solution). Session The session object is used to define the context in witch messages should be understood. Sessions properties are used to control transactions and message acknowledgment. The session object is used in the construction of the rest of the objects. This includes the messages, message consumers and message producers. Message consumer and message producer Message consumer and producer objects can be constructed with the Session object. These objects are used for sending and receiving messages.

202 Appendix B Reliability methods B.1 Computer system faults in general A fault in a computer system is an anomalous condition in the systems hardware or software [Torres-Pomales 2000]. The result of faults can be that the computer system does not meet its specifications. The fault might result in a whole system crash or that the application will crash or hang. In this context, an application crash means that the process (or thread) the application is executing in is no longer in the system. An application hangs if the application process is still in the system but it is making no progress. A system crash means that hardware is not operating. Faults can be classified based on their characteristics and their effects on the computer system: Transient/Permanent Transient errors are errors that come and go. Transient hardware faults can occur because of for example loose electrical connections. Transient software faults are often caused by subtle software errors (race conditions), resource exhaustion or transient errors in hardware. These faults can often be masked by re-executing the operations that lead to the fault. Permanent faults are present in the system until they are removed. Re-executing the operations that lead to the error will also lead to the error. Permanent faults sometimes corrupt the system leaving the system unrecoverable. Recoverable/unrecoverable Recoverable errors (soft errors) do not permanently damage the system. For a hand held computer soft errors can occur if the operating system crashes. Unrecoverable errors (hard errors) permanently damage the system, leaving it in a corrupt state. Hand held computers are especially prone ix

203 B.2 GENERAL FAULT TOLERANCE TECHNICS x to unrecoverable errors. They are more subject to theft or dropping than stationary computers. 1 Most errors are transient and recoverable [Gray 1991]. B.2 General fault tolerance technics Building correct software for complex systems has proven to be very difficult. Even with extensive testing, there can still be errors in the software, because you can generally only test for error types that you expect. In addition, some errors are state-dependent and are activated by particular input sequences. In a test situation this particular state followed by a particular input sequence might not be used. Knowing that faults can not be avoided pushes us to other techniques: preventing catastrophes, containing errors (prevent spreading), detecting faults(error checking technics) and recovering from faults. 2 Fault tolerance design is concerned with the ability of a system to continue delivery of services in the presence of faults in the system [Torres-Pomales 2000]. This has two dimensions: availability and data consistency [Huang 1995]. For some systems availability of the system is important, for other systems data consistency is important. To what extend fault tolerance technics should be used in a system depends on the requirement of the application. A hospital system places both data consistency and availability high. Fault tolerance usually implies a use of more resources than are minimum needed to deliver a service. Typically redundancy in hardware (RAID discs), software, information (error detection and correcting codes) or time (repeating a computation in ways that allows faults to be detected) is used. 1 Sometimes unrecoverable faults can be recovered. If the hand held computer is broken we might not be able to recover the data on this computer, but if we use a technic where we have check pointed the state to stable storage on another computer (for example stationary computer on a fixed network) we can, in fact, recover it using this computer and a new hand held computer 2 There are in fact 4 ways to deal wit faults: 1. Prevention. By using design technics for software and hardware some errors might be prevented. 2. Removal. By testing some faults can be found and removed. 3. Restriction. By ignoring the error and avoiding the error by restricting the use of the system. 4. Fault tolerance. By building in measures to handle faults and making the system tolerant to those.

204 B.2 GENERAL FAULT TOLERANCE TECHNICS xi Fault tolerance can be achieved by replicating the same functionality and adding a simple mechanism for choosing between them. 3 Fault tolerance design for making a system more tolerant includes the following concepts: Detection is the basic process that initiates all the other fault tolerance techniques. Some methods of fault detections are for example time checks or check of known properties of structures. In many programming languages, exceptions are used to signal an error. Diagnosis includes accessing the damage and accessing the tolerance possibilities. Containment includes technics for preventing an error from spreading. A modular/layered design of complex systems is useful and often used. Masking includes technics for making the error transparent for the user. Use of replication is actually a masking technic, but in this context masking is typically provided by redundancy in data (error checking code) or replicating data. The redundancy is often expressed in error handling code. Recovery and repair includes technics to reach a state other than the error state and work around the conditions that lead to the error. In the following we will look closer on three concepts: containment, masking and recovery. B.2.1 Containment The concept of containment includes technics for preventing an error from spreading. Containment of errors in complex systems can be supported by using a modular or layered design. Each module defines a containment region. A hierarchically structured containment system could be build by the modules. The basic idea is that each module should prevent internal errors from spreading by handling them in the module. Fault tolerance should be made 3 Designs for fault tolerance of hardware are often based on replication. Multiple hardware pieces can collaborate in giving correct results (for example by voting) or backup hardware can take over from defect primary hardware.

205 B.2 GENERAL FAULT TOLERANCE TECHNICS xii on a module level. Faults that can not be handled internally should be thrown to module levels that can handle them. If a module detects an error (Exception) in the interface, because of an invalid invocation from peer module, the fault (Exception) should be thrown to the peer module. Similarly a higher level module that encapsulates the module has to handle faults that occur at lower level. According to the end-to-end argument [Saltzer 1994] full support for detection and handling off errors can not be made without the knowledge of the top level (for example the application). B.2.2 Masking A module that handles a fault internally can mask the fault from the surroundings. Hardware faults are often masked by replicating modules and using a simple mechanism for choosing between them (for example voting). Software faults can also be masked by replicating the code. These technics are called multiversion technics and can be used at all levels of abstraction from redundancy of small methods to redundancy of modules and whole programs. 4. By executing the code piece replicas in parallel and selecting between them or by executing them in series until there are no faults, raises the probability of completing an operation without faults. Multiversion technics can be used at all levels of abstraction from redundancy of small methods to redundancy of modules and whole programs. A well known technic called process pairs use two identical versions of a whole program [Tanenbaum 1995]. Usually one of the replicas is running as a primary while the other is listening and waiting to take over if the primary fails. A variation of the technic uses a watchdog 5 process to monitor the primary process and start the backup process if the primary process fails. Replication of server code and data is often used to mask faults. If server data are replicated to clients, communication faults like disconnections can be masked [Douglas 1995] [Kistler 1992]. Systems that support disconnected operation often offer resilience to communication errors. Replication of data increase the availability of data but introduces consistency problems, if copies can be changed independently (optimistic update strategy). Communication faults can be masked by using redundant data like error correcting codes. Saving data, to be send, and retransmitting until success is a 4 Multiversion technics try to mimic the technics known from hardware fault tolerance: Multi version technics [Lyu 1995] are based on the assumption that a piece of software implemented differently will fail in different situations. There are many different technics: Recoverable blocks, N-version programming, N-self-checking software etc. [Torres-Pomales 2000] 5 Watchdog is a process responsible for taking some actions in case of fault of the watched module(s). Usually the action is restart of the module

206 B.2 GENERAL FAULT TOLERANCE TECHNICS xiii common technic for masking communication faults. In message-orientedmiddleware (MOM) the client and server communicates asynchronously through a broker. We examined this technology in section 6 as a way to support disconnected operation. By using this type of system short periods of disconnections or breakdown of client or server can be masked. Masking communication link faults can generally be handled by systems that support disconnected operation. B.2.3 Recovery If a critical fault has occurred, the system should have measures for getting out of the fault state into a state where it is operational. If a module has not crashed it might be able to recover itself. If a module hang or has crashed, another module will have to restart the module. In last instance the user has to restart some modules. In general it is good design that each module is able to recover itself when restarted by another module. In static recovery, a module is recovered to a predefined state (reset). In dynamic recovery a module is brought to a previous state. The concept of recovery implies that the fault is recoverable 6. technics can be based on two methods: 1. Soft state Recovery Data are save in soft (non-persistent) memory (usually in other processes). When a process crashes its state is recreated from copies kept by other processes. Process pair systems can be build that only uses non-persistent storage (if the secondary process is always lagging behind the primary). Some systems use recovery from state broadcast by alive modules in the system [Fox 1997]. These technics rely on the assumption that at least one module is running and that communication links are stable. 2. Hard state Data are saved in hard (persistent) storage. When process crashes, its state can be recreated from the persistent data. Later we present the system based on hard state. 6 Recoverable error was presented in B.1

207 B.2 GENERAL FAULT TOLERANCE TECHNICS xiv Hard state For high levels of reliability recovery, technics based on hard (persistent) storage are used. One approach is to divide variables in an application into stable and volatile variables. Variables necessary to recreate state could be held as stable variables. Another very similar approach is to use periodic checkpointing of state to hard storage. Checkpointing can either occur after completion of a module, randomly in code or at certain time intervals. If checkpointing is used with the granularity on variables and these are checkpointed after every change, then this technic is analog to the use of stable variables. Checkpointing can be seen as a more general technic that includes technics that use stable variables. For performance reasons state that are selected for checkpointing should be selected carefully and minimized. By using checkpointing the system can go back to a previous state. If the fault is transient re-executing the actions might overcome the error (rescheduling might be used). Checkpointing or use of stable variables can be done on local storage or at a remote storage. By storing state in storage on the network, the system can recover from errors that otherwise would be unrecoverable. A hand held computer is more exposed to theft, to be dropped or destroyed than a stationary computer is. As mentioned, hand held computer s state can be saved on a storage on the fixed network. The downside of the scheme is that it consumes bandwidth and can not function during disconnections. Sometimes, state is spread among a number participants. In this case, each process in the distributed application must checkpoint its state and the collected checkpoint is a global checkpoint [Pradhan 1996]. The participants can either coordinate the checkpointing (for example time synchronized) or checkpoint freely. When checkpointing is uncoordinated some coordination must happen during recovery to select correct checkpoints.

208 Bibliography [AirAcesss 1994] AirAcesss 2.0 AirAcesss 2.0 Mobile Networking Software AirSoft Inc. 1994, use data compression + differential file transfer. [Alonso 1995] Exotica/FMDC: Handling Disconencted Clients in a Workflow Management System, Alonso G., Gunthor R., Kamath M., Abbadi, A. El., Mohan C. Proc. 3rd Int l Conf on Cooperative Information Systems, Vienna, May [Balakrishnan 1997] A Comparison of Mechanisms for improving TCP Performance over wireless Links. Balakrishnan H., Padmanabhan V. N. Seshan S. AND Katz R. H., IEEE/ACM Transactions on Networking 5(6): , December [Bartlett 1994] W4-the Wireless World-Wide Web, J. Bartlett, in Proc. of the Workshop on Mobile Computing Systems and Applications, Santa Cruz [Bartlett 1995] Experience with a Wireless World Wide Web Client Bartlett J WRL Technical Note TN-46. Digital. Western Research Laboratory. [Brewer 1998] A Network architecture for heterogeneous mobile computing, Brewer, E., Katz R., Chawathe Y, Gribble S., Hodes T., Nguyen G., Stemm M., Henderson T., Amir E., Balakrishnan H., Fox A., Padmanabhan V., And Seshan S IEEE Personal Communications, 5(5):8-24. [Brown 1995] DeckScape: An experimental web browser, M. H. Brown and R. A. Schillner, Tech Rep. 135a, Digital Equipment Corpoation Systems Research Center, Mar [CITRIX 2001] [Chang 1998] Web Browsing in a Wireless Environment: Disconnected and Asynchronous Operation in ARTour Web Express, Chang H, Tait C, Cohen N, Shapiro M, Mastrianni S, Floyd R, Hausel B, Lindquis xv

209 BIBLIOGRAPHY xvi D, Proceedings of the third annual ACM/IEEE international conference on Mobile computing and networking September 26-30, 1997, Budapest Hungary. December [Coppieters 1995] A Cross-Platform Dinary Diff, Coppieters, K. Dr. Dobb s Journal May [Douglas 1995] BAYOU, Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System, Douglas B. Terry et. al., Proceedings of the Fifteenth ACM Symposium on Operating System Principles, December 1995, pp [Edwards 1997] Designing and Implementing Asynchronous Collaborative Applications with Bayou, W. K. Edwards, E. D. Mynatt, K. Petersen, M. J. Spreitzer, D. B. Terry, and M. M. Theimer. Proceedings of the Tenth ACM Symposium on User Interface Software and Technology (UIST), Banff, Alberta, Canada, October, 1997, pages [Forman 1999] Wanted: Programming Support for Ensuring Responsiveness Enspite Of Ressource Varibility And Volatility, Forman G. H. p. 286 i bogen Mobility: Processes, computers and Agents, Milojicic D., Douglis F., Wheeler R ACM Press, ISBN [Fox 1997] Cluster-Based Scalable Network Services, Fox A., Gribble S. D., Chawathe Y., Brewer E. A. and Gauthier P., Proceedings of the 16th Symposium on Operating Systems Principles, December [Fox 1998] Adapting to Network And Client Variation Using Active Proxies: Lessons and Perspectives, IEEE Personal Communications, 5(4):10-19, August [Gray 1991] High-availability computer systems, Gray J, Siewiorek D. P., IEEE Computer, 24(9):39-48, [Gray 1996] The Dangers of Replication and a Solution, Gray J, Helland P, ONeil P. and Shasha D. In ACM SIGMOD Conference, pages , [Housel 1996] WebExpress: A System for Optimizing Web Brosing in a Wireless Enviroment. Housel B. C., Lindquist D. B., Proceedings of the Second Annual International Conference on Mobile Computing and Networking pp , November [Howard 1988] Scale and Performance in a Dristributed File System (Andrew File System), Howard et al., ACM Transactions on Computer Systems 6 (1), february 1988.

210 BIBLIOGRAPHY xvii [Huang 1995] Software Fault Tolerance in the Application Layer Yennun Huang and Chandra Kintala, At&T Bell Laboratories, Chapter 10, 1995 John Wiley& Sons Ltd. [Jing 1999] Client server computing in mobile environments, Jing J., Helal A., and Elmagarmid A. K. ACM Computing Surverys, June [Joseph 1996] Building Reliable Mobile-Aware Applications using the Rover Toolkit, Joseph, A. D., Tauber, J. A. Kaashoek M. F. Second ACM International Conference on Mobile Computing and Networking (MobiCom 96). [Joseph 1997] Mobile Computing with the Rover Toolkit, Joseph, A. D., Tauber, J. A. Kaashoek M. F. IEEE Transactions on Computers, 46(3): , March [Kaashoek 1995] Dynamic Document: Mobile Wireless Access to the WWW, M. Frans Kaashoek, Proc. IEEE Workshop on Mobile Computing and Applications dec (Mosaic) [Kistler 1992] Disconnected Operation in the Coda File System, James J. Kistler and M. Satyanarayanan. ACM Transactions on Computer systems, 10(1):3-25, February [Klingemann 1997] Enabling Cooperation among Disconencted Mobile Users,Klingemann T. Tesch, Wasch J. Proceedings of the 2th IFCIS International Conferance on Cooperative Information Systems(CoopIS 97) June [Kunz 1999] An architecture for adaptive mobile applications, Thomas Kunz and James P. Black, System and Computer Engineering Carleton University. Proceedings of Wireless 99, the 11th International Conference on Wireless Communications, Calgary, Alberta, Canada, July [Lyu 1995] Software Fault Tolerance, Michael R. Lye, editor. John Wiley & Sons [Mummert 1995] Exploiting Weak Connectivity for Mobile File Acess, L. B. Mummert, M. R. Ebling, M. Satyanarayanan. Proceedings of the 15th ACM Symposium on Operating System Principles, pp , December [Niemeyer 1997] Exploring Java, Niemeyer P, Peck J. 2nd Edition September O Reilly & Associates, Inc. [Noble 1999] Experience with adaptive mobile applications in Odyssey, Noble B. D., Satyanarayanan M. Mobile Networks and Applications 4 (1999)

211 BIBLIOGRAPHY xviii [Oracle 2001a] Oracle Mobile Agents Writing Applicaions Chapter 2 Architecture Overview, Release 3.0 A [Oracle 2001b] Oracle8i Lite Developer s Roadmap [Petersen 1997] Flexible Update Propagation for Weakly Consistent Replication, K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. Theimer, and A. J. Demers. Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP-16), Saint Malo, France, October 5-8, 1997, pages [Pradhan 1996] Recovery in Mobile Wireless Enviroment: Design and Trade-off Analysis,D. K. Pradhan, P. P. Krishna and N. H. Vaidya, In Proceedings of the 26th IEEE International Symposium on Fault- Tolerance Computing, 7(10), October [Saltzer 1994] End-To-End Arguments In System Design, Saltzer J.H., Reed D.P., Clark D.D. ACM Transactions on Computer Systems. 2(4): , November [Satyanarayanan 1993] Coda:Experince with Disconnected Operation in a Mobile Enviroment, M. Satyanarayanan, J. Kistler, L. Mummert, M. Ebling, P. Kumar and Q. Lu. Proceedings of the USENIX Symposium on Mobile and Location Independent Computing, August [Satyanarayanan 1996] Fundamental Challenges in Mobile Computing, Satyanarayanan M. Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing. Philadelphia, PA, May [Seybold 2001] Mobile Transport, Seybold A. M., Whitepaper made by Newtech Systems. The document decribes a product called smartip. The system intercepts TCP/IP connections by loopback to a proxyserver on the client and use there own propritery protocol to communicate over a wireless network. http : //nettechrf.com/about w ite s eybold.shtml [Siegel 1999] CORBA 3 Fundamentals and Programming, Siegel J. 1999, 2nd Ed John Wiley and Sons. ISBN : [Steinke 95] Middleware Meets the Network, Steinke S. LAN:The Network Solutions Magazine 10, 13 (December 1995):56 [Tanenbaum 1995] Distributed Operating Systems af Tanenbaum, Andrew S. Prentice-Hall International (UK), 1995.

212 BIBLIOGRAPHY xix [Tanenbaum 1996] Computer Networks Tanenbaum, Andrew S., 3. ed, 1996, Prentice-Hall, ISBN [Terry 1998] The Case for Non-transparent Replication: Examples from Bayou, D. B. Terry, K. Petersen, M. J. Spreitzer, and M. M. Theimer. IEEE Data Engineering, December 1998, pages [Torres-Pomales 2000] Software Fault Tolerance: A Tutorial Wilfredo Torres-Pomales, NASA/TM , Langley Reserch Center, Hampton, Virginia, October 2000.