Measurement-Based Modeling of Distributed Systems. Meßbasierte Modellierung verteilter Systeme

Transcription

1 Measurement-Based Modeling of Distributed Systems Meßbasierte Modellierung verteilter Systeme Der Technischen Fakultät der Universität Erlangen-Nürnberg zur Erlangung des Grades DOKTOR-INGENIEUR vorgelegt von Kai-Steffen Jens Hielscher Erlangen

2 Als Dissertation genehmigt von der Technischen Fakultät der Universität Erlangen-Nürnberg Tag der Einreichung: 12. März 2008 Tag der Promotion: 21. April 2008 Dekan: Prof. Dr.-Ing. habil. Johannes Huber Berichterstatter: Prof. Dr.-Ing. Reinhard German Prof. Dr.-Ing. Wolfgang Schröder-Preikschat

3 Contents List of Figures 7 List of Tables 9 Abstract 11 Zusammenfassung 15 1 Introduction 19 2 Related Work Measurements Time Synchronization Input Modeling Performance Evaluation of Web Servers The Web Cluster Laboratory The Linux Virtual Server System Hardware Setup Measurement Concepts Computer Clocks Clock Errors Classification in the Frequency Domain Classification in the Time Domain Reference Clocks NTP Time Sources The PPS API

4 Contents 5 Dedicated Measurement Infrastructure PPS Pulse Latency Echo Feedback Offline Synchronization Instrumentation IP Stack Instrumentation Web Server Instrumentation Load Generator Instrumentation Application Server Instrumentation Summary Performance Data Analysis of the Traces Example Measurement Results Advanced Input Modeling Traces and Empirical Distributions Outlier Values Autocorrelation Standard Theoretical Distributions Multimodal Distributions Multimodal Distributions with Phases Bézier Distributions A New Model for Autocorrelated Data Simulation Model Model Structure TCP RFC RFC RFC RFC RFC Client Application TCP Processor Network Channels Load Balancer

5 Contents 7.6 Servers Processes System Processes Processor Utility Classes and Execution Control Experiments Conclusions and Future Work 143 Bibliography 147 5

6 Contents 6

7 List of Figures 3.1 Distributed Web Server Architecture Load Balancing via NAT Hardware Monitoring Software Monitoring Hybrid Monitoring Latencies for Reading the Time Frequency Changes with Temperature Frequency Variation Frequency Distribution Phase Errors UDP Delays Power-Law Spectral Densities Allan Deviation NTP Time Transfer NTP Architecture NTP and the PPS API Detail of UDP Delays Synchronization System Interrupt Latencies External Clock Time Deviation Offline Synchronization IP Stack Instrumentation Application Server Instrumentation Architecture Illustration of Delays in the Object System Trace Plot of Measured Delays Trace Plots of Individual Delays

8 List of Figures 5.12 Delay Components for Requests Summary Statistics the Delays Histograms of Observed Delays Correlation Plots (lag 500) Correlation Plots (lag 40) Trace Plots Sorted by Real Server Distribution Comparison for Delay Distribution Comparison for Delay State Chart for Phase Transitions Distribution Comparison for Delay Screenshot of PRIME Distribution Comparison for Delay Histogram H o of the Deltas for Delay Trace Plot of Delay Delta over the Values of Delay D Histogram of Delta Weighting Areas Weighting Factors Original and Weighted Histogram for Delta Distribution Comparison for Delay Conceptual Model TCP Model of a TCP Segment Central TCP State Chart receive_packet Structure of the Client Object Conceptual Model of the Network Channels Server Model and Embedded Objects Graphical Comparison of the Results

9 List of Tables 4.1 Slope Characteristics Quantile Summary for Delays in Microseconds Fitted Standard Theoretical Distributions Fitted Multimodal Distributions Fitted Multimodal Distributions with Phases Core Simulation Parameters Quantile Comparisons in Milliseconds CPU Load Comparison

10 List of Tables 10

11 Abstract Nowadays, distributed systems are ubiquitous. Since the delays during processing in such systems are often essential, many research projects deal with performance analyses of these systems. Most of them treat the systems from an abstract point of view and oarse-grained models are built. These do not include results of detailed measurement studies of real systems. The goal of this work is to demonstrate a methodology that allows to create precise models of distributed systems that are parametrized, calibrated and validated from fine-grained measurements of a laboratory setup. The approach is exemplified on a cluster-based web server system. The resulting model contains many details that influence the behavior and performance of the system like one-way delays or system activity caused by the hardware. Since network aspects play a central role in distributed systems, it is important to be able to capture the timing characteristics of packet delays in the network exactly. Therefore, a modular system has been developed for the Linux operating system that allows to record sent and received TCP segments and to generate timestamps for the sending and receiving actions. For that purpose, the netfilter framework has been extended to insert packet headers and corresponding timestamps into a ring buffer in kernel space. The timestamps in the resulting event trace are generated using the clock of the object system that is observed. To calculate oneway delays for packets, it is necessary to synchronize the clocks of the nodes, because the timestamps for the sending and the receiving event are taken from different clocks. This can be achieved with standard solutions like the use of NTP during the measurement. An alternative that is more suitable in many situations is the use of a dedicated offline synchronization. This method is an own development based on an algorithm classically used for online synchronization of computer clocks. The method allows to use the cycle counter of the processor (TSC) of the object system for timestamping. Due to this feature, a context switch for obtaining kernel clock timestamps can be avoided. Therefore, the latencies for 11

12 Abstract reading the clocks are minimized. To implement this solution, the PPS output of a GPS receiver is connected to the nodes of the object system that need to be synchronized. PPS signals are pulses that mark the beginning of every second. During the measurement, timestamps for these PPS pulses are recorded in an additional trace. The standardized interface for PPS pulse reception, the PPS API, was extended to also use the TSC for timestamp generation. The resulting time trace can be used after the measurement to calculate the offset and the frequency of each individual clock that has been used for timestamping in the event traces. Using this information, the traces can then be related to a global timing reference. Interrupt latencies occur during the generation of the time trace and have a negative effect on the accuracy of the synchronization. Therefore, a hardware module that allows to measure the time between the PPS pulse and the invocation of the interrupt handler has been developed. This innovation allows a correction of the timestamps in the time trace. In addition to the fine-grained event recording, summary performance data are recorded to calibrate and validate the model. In a second step, the obtained data are processed so that they can be represented in the model in a sensible way. For this purpose, the standard method of theoretical distribution function fitting is used in the input model. It is supplemented by advanced techniques like multimodal and Bézier distributions. This is not sufficient for all data sets, therefore, distributions are combined so that the correlation of the measured data is represented using a phase approach. For some sets of delays, this does not produce satisfactory results. Due to the buffering of the Ethernet frames in network elements, these delays feature both a high correlation over large lags and a strict upper and lower bound. To represent these values, a new procedure is introduced. It samples the differences of successive values from a part of an empirical distribution function. This part is selected according to the value the random variable has reached. Samples generated in this way exhibit a good compliance with the original values regarding both the density and the correlation structure. The representations of the data are utilized in a detailed simulation model of the complete system. It contains the most important aspects of the web cluster that influence the performance. The model has been realized in AnyLogic, a simulation tool that is based on UML and Java. The model consists of five objects on the root level (client, network channel 1, load balancer, network channel 2 and server nodes). Some of the elements have a multiplicity, i.e. more than one instance of these objects is present. For each HTTP request, an individual TCP 12

13 connection is simulated. TCP is the lowest level of the protocol stack that is modeled explicitly. An instance of the client object is created for each TCP connection. Besides an instance of a processor model object, each client contains an instance of a TCP object. This object models the TCP stack of the operating system and is a complex sub-model that reproduces the main features of the protocol. It controls the connection establishment and tear-down, the protocol dynamics, and supports message segmentation. Modeled properties include slow start, congestion avoidance, timeout calculation, fast retransmit and fast recovery. Both network channels induce packet delays that represent the measured characteristics. The load balancing object forwards incoming connection requests to specific server nodes according to configurable scheduling algorithms. The server objects are responsible for handling the requests. The processing phases of concurrent requests of different TCP connections are interleaved at the server objects. For that purpose, each server contains an individual instance of the TCP object mentioned before for every TCP connection. The servicing is done in process objects. Processor time is assigned to them by a central processor object. The processing in the user mode can be delayed by system activity. Confidence intervals are utilized for the execution control of the simulation. The resulting model allows a fine-grained evaluation of the behavior of the system. The time needed to reach a defined quality of the simulation results is acceptable. Altogether, we created a solution that is easily applicable and allows to obtain fine-grained measurement data from a laboratory setup of a system, to represent this data with their densities and correlations adequately and to create a detailed simulation model that contains quintessential features of the system and allows performance evaluations of different configurations. 13

14 Abstract 14

15 Zusammenfassung Verteilte Systeme sind heute allgegenwärtig, und die Verzögerung bei der Verarbeitung von Aufgaben in solchen Systemen ist oft eine wichtige Größe. Aus diesem Grunde existieren diverse Arbeiten, die sich mit der Leistungsbewertung solcher Systeme befassen. Die meisten der Forschungsprojekte behandeln dabei eine grobgranulare Modellierung der Systeme auf abstrakter Ebene und kommen meist ohne detaillierte Messungen an realen Systemen aus. Ziel des vorliegenden Werkes ist daher, eine Methodik aufzuzeigen, mit Hilfe derer ein genaues Modell verteilter Systeme erstellt werden kann, das durch feingranulare Messungen an einem Laborsystem parametriert, kalibriert und validiert wird. Am Beispiel eines clusterbasierten Webserver-Systems wir diese Vorgehensweise veranschaulicht. Das resultierende Modell enthält dabei viele Details, die das Verhalten und die Leistung des Systems beeinflussen, wie etwa Einwegverzögerungen im Netzwerk und durch die Hardware verursachte Systemaktivitäten. Da bei verteilten Systemen die Netzwerkaspekte von zentraler Bedeutung sind, ist es wichtig, die zeitlichen Charakteristika der Paketlaufzeiten im Netz genau erfassen zu können. Daher wurde ein modulares System für Linux entwickelt, das die Aufzeichnung und Zeitstempelung von gesendeten und empfangenen TCP- Segmenten unterstützt. Dazu wurde eine Erweiterung des Netfilter-Frameworks vorgenommen, mit der die Paketköpfe und zugehörige Zeitstempel in einen Ringpuffer im Adreßbereich des Betriebssystemkerns eingetragen werden können. Die Zeitstempel in der produzierten Ereignisspur werden dabei mit der Uhr des jeweiligen beobachteten Objektsystems generiert. Um daraus Einweglaufzeiten für Pakete ermitteln zu können, ist eine Synchronisation der Uhren der Objektsystemknoten nötig, da die Zeitstempel für die Absende- und Empfangszeitereignisse mit unterschiedlichen Uhren gewonnen werden. Dies kann während des Meßvorganges mittels bekannter Techniken wie NTP oder nach der Messung in einem eigens dafür entwickelten Offline-Synchronisationsprozeß geschehen. Unter Zuhilfenahme dieser Methode kann die Zeitstempelung auch mit dem Zyklenzähler des 15

16 Zusammenfassung Prozessors (TSC) des Objektsystems erfolgen und somit kann der in machen Fällen nötige Kontextwechsel beim Lesen der Uhr vermieden werden. Zum Zwecke der Synchronisation wird der PPS-Ausgang eines GPS-Empfängers mit den zu synchronisierenden Knoten des Objektsystems verbunden. PPS-Pulse sind standardisierte Signale, die jeweils genau zum Beginn jeder Sekunde generiert werden. Während der Aufzeichung von Ereignissen wir zusätzlich eine Spur mit Zeitstempeln für die PPS-Pulse protokolliert. Aus dieser Zeitspur kann somit im Nachhinein die Abweichung und die Frequenz der jeweils für die Zeitstempel der Ereignisse verwendeten Uhr bestimmt werden. Mit diesen Informationen können die Spuren dann auf eine globale Zeitreferenz bezogen werden. Die bei der Generierung der Zeitspur auftretenden Interruptlatenzen beeinflussen die Synchronisation negativ. Daher wurde eine Schaltung entwickelt, die es ermöglicht, für jedes PPS-Signal die Zeit zwischen dem Puls und dem Aufruf der Interrupt-Behandlungsroutine zu bestimmen. Diese Zeit erlaubt dann eine Korrektur der Zeitstempel in der Zeitspur. Zusätzlich zur feingranularen Ereignisaufzeichnung werden summarische Leistungsdaten zum Kalibrieren und Validieren des Modells erhoben. In einem zweiten Schritt werden die gewonnenen Meßdaten dann so aufbereitet, daß sie im Modell sinnvoll repräsentiert werden können. Hierzu werden bekannte Methoden der Eingabemodellierung mittels theoretischer Verteilungsfunktionen eingesetzt. Diese werden durch fortgeschrittenen Techniken wie multimodale und Bézier-Verteilungen ergänzt, dies führt jedoch nicht bei allen gemessenen Werten zum gewünschten Ziel. Daher werden die Verteilungsfunktionen mittels Phasen so kombiniert, daß auch die Autokorrelation der Meßdaten repräsentiert werden kann. Für bestimmte gemessene Verzögerungen ist auch dieses Vorgehen nicht erfolgreich, da diese aufgrund der Pufferung von Ethernet-Rahmen in den Netzwerk-Komponenten hohe Autokorrelation über weite zeitliche Entfernungen und eine feste Ober- und Untergrenze aufweisen. Zu deren Repäsentation wurde ein neues Verfahren entwickelt, bei dem die Differenzen aufeinanderfolgender Werte aus einem Teil einer empirischen Verteilungsfunktion generiert werden. Der entsprechende Teil der Verteilungsfunktion wird dabei anhand des bereits erreichten Wertes der eigentlichen Zufallsvariable ausgewählt. Die so erzeugten Werte stimmen sowohl bezüglich ihrer Dichte als auch in ihrer Korrelationsstruktur gut mit den Originaldaten überein. Die Repräsentationen der Daten werden in einem detaillierten Simulationsmodell des Gesamtsystems eingesetzt. Es bildet wesentliche Aspekte des Webclusters ab, die die Leistung beeinflussen. Das Modell wurde in AnyLogic realisiert. Dieses 16

17 Simulationswerkzeug basiert auf UML und Java. Das Modell besteht auf der obersten Ebene aus fünf Objekten (Client, Netzwerkkanal 1, Lastverteilungsknoten, Netzwerkkanal 2 und Serverknoten), die teils eine Multiplizität besitzen, d.h. mehrfach vorhanden sind. Für jeden HTTP-Request wird eine eigene TCP-Verbindung simuliert. Dabei bildet TCP die niedrigste Ebene des Protokollstapels, die explizit abgebildet wird. Für jede TCP-Verbindung wird eine eigene Instanz des Client- Objekts erzeugt. Neben einer Prozessor-Instanz ist im Client eine Instanz eines TCP-Objekts eingebettet. Diese modelliert den TCP-Stack des Betriebssystems und ist ein komplexes Teilmodell, das wesentliche Eigenschaften des Protokolls nachbildet. Es regelt den Verbindungsaufbau und -abbau, die Dynamik des Protokolls und unterstützt die Segmentierung von Nachrichten. Modellierte Eigenschaften sind unter anderem Slow Start, Congestion Avoidance, Timeout-Berechnung, Fast Retransmit und Fast Recovery. Die beiden Netzwerkkanäle erzeugen Paketverzögerungen analog der gemessenen Daten. Das Lastverteilungsobjekt weist eingehende Verbindungswünsche nach konfigurierbaren Scheduling-Strategien den Server- Knoten zu. Diese Server-Objekte bearbeiten dann mehre Requests, die einzelnen Bearbeitungsphasen unterschiedlicher TCP-Verbindungen sind ineinander verschränkt. Dazu besitzt ein Server für jede Verbindung eine Instanz des vorher erwähnten TCP-Objekts. Die Bearbeitung findet in Prozeß-Objekten statt, denen Prozessorzeit zugeteilt wird. Die einzelnen Bearbeitungsphasen können durch Systemprozesse verzögert werden. Für die Simulationskontrolle werden Konfidenzintervalle eingesetzt. Das resultierende Modell erlaubt es, das Systemverhalten feingranular abzubilden, wobei die benötigte Zeit bis zum Erreichen einer vorgegebenen Güte der Simulationsergebnisse akzeptabel bleibt. Somit wurde eine einfach einzusetzende Lösung geschaffen, die es ermöglicht, feingranulare Meßdaten an einem Laborsystem zu erheben, diese Daten mit ihren Dichten und Autokorrelationen angemessen zu repräsentieren und ein detailliertes Simulationsmodell zu erstellen, das wesentliche Systemaspekte enthält und Aussagen über die Leistung unterschiedlicher Konfigurationen zu gewinnen. 17

18 Zusammenfassung 18

19 1 Introduction During the last ten years, the Internet has become a major economic factor. Electronic business has replaced traditional mail order in many areas. More and more new business models utilizing the Internet emerge. Individuals can start with new ideas to create solutions without the need for excessive funding. One example for a new platform can be found in [92], where the economic opportunities of an Internet portal for customer referral programs were evaluated with simulation models. The number of persons using the Internet is also constantly growing. According to statistics of the Miniwatts Marketing Group [60], the worldwide Internet usage has grown from the year 2000 to the end of 2007 by more than 265%. The largest growth can be observed in the Middle East with a rate of over 920% in the period mentioned above. A growing percentage of these users access the Internet over broadband connections. This allows service providers to offer innovative services like Voice-over-IP or IP TV. Due to these trends, the need for high performance server systems to fulfill the demands of a growing user base is rising. The success of open source operating systems and the availability of powerful PC hardware at low cost allows to build cost-effective solutions to handle these challenges by combining commodity server hardware with load balancing mechanisms to build high-performance clusterbased web servers. Customer satisfaction depends mainly on the availability and speed of a service. A method to evaluate the expected delay for user transactions in early design phases of a server architecture helps to dimension the system. As both the hardware and the application have a large impact on the performance, an approach that bases the modeling on measurements of individual components promises more exact results than a simpler model like the common queuing network models which often assume Markovian traffic to allow fast and easy evaluation of the models. 19

20 1 Introduction When planning measurements, it is important to keep an eye on the model in which the results are to be used. On the other hand, when building a model, it is equally important to be able to parametrize the model according to real-world data. For this reason, we designed and implemented a solution that allows to measure the most important performance data of distributed systems, to represent them in the model and to simulate the complete system with a great level of detail. The simulation model not only allows to assess the performance of different architectures under various workload conditions, it also helps to understand the influence of operating system aspects like interrupt handling and scheduling. The object of study is a cluster-based web server that has been installed in our laboratory. It allows to observe a live system under realistic load conditions. As we have full control over the system, we can change its architecture, modify the operating system and generate load with different characteristics. The application can also been changed from serving static pages with a simple Apache web server to a multi-tier system that implements a book shop with web servers, application servers, databases and even an emulation of the credit card authorization. This flexibility provided an excellent base for experiments both in the measurement and in the modeling phase. Our measurement infrastructure is based on GPS to allow measurements of oneway delays over long distance paths that often occur in distributed systems on the Internet, as GPS offers a global time base around the globe. A processing of the timing information with dedicated hardware and a sophisticated offline time synchronization process allow to mitigate side-effects of noise sources like interrupt latency and thermal effects in the time synchronization. A configurable modular instrumentation of the TCP/IP stack of the Linux operating system helps to gather a high volume of data without significant degradation of the performance of the system. Further, emphasis is put on applying advanced input modeling techniques in order to adequately represent the basic parameters in the model. Special care has been taken to reflect autocorrelation in the input data. A simulation model based on UML has been built. It shows how to represent mechanisms like queuing in buffers, transport control mechanisms and contention for CPU power with other processes. The resulting model thus combines a precise stochastic representation of low-level system parameters with an explicit representation of system behavior at higher observable levels. The simulation allows to gather performance data 20

21 of various configurations under different load characteristics without additional measurements and input modeling. We illustrated our approach on the basis of a cluster-based web server architecture, but most aspects are not limited to this field of application. Even the most systemrelated tasks of the measurement concepts have been demonstrated to be applicable in various environments ranging from web portals over wireless local area network transmissions to mobile embedded systems on soccer robots. The measurement studies in [29, 39, 40, 72, 73, 75, 76] are based on the work presented here. The following chapter 2 presents some related work in the context of our fields of research. It is followed by a brief description of the laboratory setup in chapter 3. Chapter 4 illustrates the basic concepts for performance measurements and shows the problems to deal with during measurement studies. In chapter 5 we present our solution for detailed, fine-grained measurements of distributed systems. Besides the instrumentation of the system, this also includes two approaches to improve the quality of the needed time synchronization in software monitoring: Echo Feedback and Offline Synchronization. Chapter 6 illustrates how the measured data can be represented in a performance model of the system preserving the most important statistical parameters. A simulation model of the web cluster system based on UML is presented in chapter 7. It includes various details that influence the dynamics and performance of the system. These details are typically not found in classic queuing models of such systems. Chapter 8 concludes the work and gives directions for future research in this area. 21

22 1 Introduction 22

23 2 Related Work Since the work presented here touches different fields of research, there exist numerous related publications and only some of the most influencing ones are mentioned in the following sections. 2.1 Measurements Various approaches for performance measurements are presented in [36] by Raj Jain and in [38] by Klar et al. The second book also demonstrates the application of hardware and hybrid monitoring for different distributed systems as they have been implemented at the Department of Computer Science 7 (Computer Networks and Communication Systems) of the University of Erlangen-Nürnberg. The hardware monitor ZM4 was built and utilized for these projects. An extensive configurable instrumentation of the the Linux kernel is the Linux Trace Toolkit (LTT) [100]. The system operates efficient and provides valuable information, but the level of detail provided makes it hard to filter relevant information. Furthermore, it is implemented as a kernel patch and is thus not easily adaptable to different kernel versions. Due to the extensive instrumentation, this solution is more intrusive and causes more measurement overhead than an instrumentation that is specifically tailored to the observed system. New versions of the system are called LTTng [21]. Its applicability to distributed systems has been demonstrated in [95]. Our implementation of the IP stack instrumentation affects only some parts of the network packet processing and can be configured to be applied only to packets of interest. Therefore, the measurement overhead is greatly reduced. Marcus Meyerhöfer s PhD thesis [55] presents a comprehensive performance measurement solution. It was implemented at the Department of Computer Science 6 (Data Management) of the University of Erlangen and serves similar purposes as our AOP-based monitoring of the J2EE application server. Compared to his work, our AOP-based 23

24 2 Related Work instrumentation is based on standard techniques and, depending on the extend of the instrumentation, is expected to cause less overhead. Nonetheless, more manual effort has to be put in the instrumentation when applying our approach. 2.2 Time Synchronization The most important solution for computer clock synchronization that also influenced the work presented here is the Network Time Protocol (NTP) [59]. It is intended for time synchronization over the Internet using a hybrid approach based on phase-locked and frequency-locked loops. It also marks the state of the art in this field. We used its algorithms and concepts as a basis for our own implementations and extensions as presented in chapter 5 of this thesis. Its foundations are explained in more detail in section The National Institute for Standards and Technology also published a number of algorithms to synchronize clocks to a common time base. We used one of these, the lockclock algorithm [46] by Judah Levine, as a basis for our idea of an offline synchronization solution presented in section 5.2. Some of the more recent research projects concentrate on estimating clock differences and one-way delays from statistical properties of delays measured using unsynchronized clocks [63, 66, 67]. A similar approach for offline time synchronization without a reference clock has previously been published in [31] as the result of research at the Department of Computer Science 7. All these methods are intended for use with Internet packet delays on the order of milliseconds. As our evaluation showed [19], these methods are not applicable for determining exact distribution functions for the one-way delays occurring in our laboratory setup that are only several microseconds long. Another solution for time synchronization is the IEEE standard 1588 [34]. It is intended for synchronization of measurement and control systems on a local area network in sub-microsecond range. Although this would be an optimum choice for our measurement infrastructure, it can only be implemented using specialized hardware or real-time operating systems. It mainly focuses on Ethernet architectures, but can be used with other technologies for LANs, too. Wide-area synchronization over the Internet is not possible using this standard. Our solution is based on the PPS API [62] that is intended to be used for connecting an external reference time source to one NTP server. We extended the existing solution by distributing one PPS signal to all nodes of our object system. The existing echo feature of the PPS API is intended to measure the 24

25 2.3 Input Modeling latencies involved and to use a mean value of this latency for compensation. Our improvement uses this facility to measure the individual latency of every interrupt handler invocation and to correct the timestamps dynamically. 2.3 Input Modeling Law and Kelton [45] summarize the most important methods for input modeling. Our input modeling with phases is an adaption of the process for multimodal distributions presented in the book. We also combined this approach with the methods for Bézier distribution functions of Wilson and Wagner [91, 90]. We also used their tool PRIME for the construction of the curves. While our approach for representing correlated data shares some similarities with time series approaches as they are used in the TES methodology by Melamed [50] and the ARTA processes by Cario and Nelson [13], the specific nature of the buffering effects made it necessary to find a different way to represent the data. The sampling of the differences of successive values reminds of the classical time series approach, but the construction of an empirical distribution function and the sampling of the differences from parts of this distribution according to additional constraints for upper and lower bound is a novel aspect in our work. Markovian arrival processes (MAPs) are also able to capture the correlation structures of input data to some extend, but compact forms are insufficient to generate autocorrelation over long lags and they are more suited to analytical models as they are used in traffic-based decomposition. One example for their application can be found in [27]. 2.4 Performance Evaluation of Web Servers A huge number of scientific papers have been published on the topic of performance evaluation of distributed systems. Some of the well-known analysis approaches of web server systems are published in the books of Menascé [54, 51, 52, 53]. They allow analytic solutions based on simple queuing networks with classes. The workload of the users is mapped to service demands at the different components of the system. The user behavior is represented in a customer behavior model graph (CBMG). This allows to define different ways a user can access the system and this leads to different service demand at different nodes. More detailed models of web 25

26 2 Related Work clusters are included in [14]. Their simulation study is based on a detailed model for the hardware of the system, but does not include fine-grained measurements of delays inside the system. The authors of [82] investigate the effect of different load balancing strategies on cluster-based web servers using both a laboratory setup of a cluster and a simulation model. However, they only employ high-level measurements of the load balancer and server service times. Packet delays and network dynamics are not in their focus. In [102], the performance of several load balancing schemes is evaluated. They use traces of the arrival processes as input for their simulation model, but other aspects of the overall system performance are not based on measurements. Regarding the simulation of TCP, there exist some similarities with the TCP model that is included in the INET framework for OMNeT++ [86]. Even if the level of detail included in the model is still larger than in our simulation, the focus of this tool is more on functional evaluation than on sound statistical analyses that are needed for serious performance evaluations. Even more functionality has been included in an integration of the complete TCP/IP stack of FreeBSD into OMNeT++ [8]. 26

27 3 The Web Cluster Laboratory To evaluate the performance of distributed web servers, we built a laboratory setup of a cluster-based web server [28]. Distributed web servers need at least one load balancing node that distributes incoming user requests to several nodes that process the requests with common web server software like the Apache web server. These nodes are called real servers. Load can either be generated by real clients on the Internet or by a load generator that creates synthetic load and is often located in an internal network. Figure 3.1 shows the basic architecture of a distributed web server with its components. Figure 3.1: Distributed Web Server Architecture The most common approach to load balancing is the DNS-based load balancing mechanism where the host name of the server is resolved to different IP addresses belonging to different machines according to a specified scheduling algorithm. The drawback of this method, known as round-robin DNS, is that the time-to-live entry for the DNS record must be small to avoid asymmetrically balanced load. For this 27

28 3 The Web Cluster Laboratory reason, the entry is only cached for a short time and frequent name resolution processes are needed [12]. The main field of application is therefore global load balancing. Its goal is to distribute load originated in specific geographical regions to a nearby web server so that the distance and the delay in the network are minimized. The system uses a table of IP address blocks and geographic locations to resolve the alphanumeric host name of the server to different IP addresses according to the client location. These addresses belong to different web server machines located in the Internet in different geographic regions. 3.1 The Linux Virtual Server System In our solution, we use a routing-based approach that is more suited for local load balancing where all servers are located in geographical proximity. The Linux Virtual Server [48] system is an open source project that supports load balancing of various IP-based services and supports out-of-band transmission (e.g. for FTP) and persistent connections (e.g. for SSL). It is a layer-4-switching system where routing decisions are based on fields of TCP or UDP headers like port numbers. The whole distributed web server carries a single IP address called Virtual IP Address (VIP). Requests sent to this address are balanced among the real servers carrying the different Real IP Addresses (RIP i ). Three mechanisms for load balancing are available: Network Address Translation, IP Tunneling and Direct Routing. Network Address Translation (NAT) is a method specified in RFC 1631 [22] for mapping a group of n IP addresses with their TCP/UDP ports to a group of m different IP addresses (n-to-m NAT). When used for load balancing, the VIP is assigned to the load balancer only. This node receives all incoming packets, selects the IP address of a real server according to a chosen scheduling algorithm, creates an entry in a connection table, changes the destination address of the packet to the chosen RIP i and forwards it to the selected real server. The connection table is used to route packets of the same client session (i.e. TCP connection) to the same real server and the answer packets back to the right client. The load balancer is used 28

29 3.1 The Linux Virtual Server System as the standard gateway for the real servers in their routing tables. When packets belonging to replies arrive at the load balancer, the source address is changed to the VIP and the packets are forwarded to the client via the Internet. NAT involves rewriting both the packets directed to the real server nodes and those originating from them. As the load balancer has to be used as a gateway for the real server nodes, its use is reasonable only for nodes in geographically proximity. Figure 3.2 exemplifies the functionality of this approach. Figure 3.2: Load Balancing via NAT Tunneling and Direct Routing cause less overhead because the packets sent by the real servers do not have to pass the load balancer. Since our load balancer does not reach saturation even with the NAT approach, we did most of our measurements with NAT. Details about the other two methods can be found in [48]. The Linux Virtual Server system offers different scheduling algorithms: Round Robin, Weighted Round Robin, Least Connection, 29

30 3 The Web Cluster Laboratory Weighted Least Connection, Locality-based Least Connection, Locality-based Least Connection with Replication, Destination Hashing and Source Hashing scheduling. While the first four algorithms can be used for any IP-based services, the later four are intended for cluster-based caching proxy servers. The system is implemented as a Linux kernel patch that is integrated into the netfilter framework. This framework is used for the manipulation of IP packets for firewalling and NAT. The kernel part can be configured using the user mode tool ipvsadm. Only the load balancer needs to run the Linux operating system, the real servers can operate under any OS that supports the necessary features like IP-IP encapsulation for Tunneling or non-arping interfaces for Direct Routing. In addition to monitoring the state of the real servers and removing them from the scheduling in case of an error, there are different software addons that can be used to implement a fail-over solution for the load balancer for high availability [48]. An identical configuration of all machines simplified the laboratory setup. Therefore we used Linux with a 2.4.x kernel version on all machines for our measurements. While other operating systems can create non-arp interfaces without any modification, a special hidden-patch for Linux [4] is needed for the real servers with Direct Routing. Most measurements were done serving static content with the Apache web server. The simulation model presented in chapter 7 also implements this configuration. 3.2 Hardware Setup The hardware we use in our project consists of one load balancer with the following main components: SMP mainboard with ServerWorks ServerSet III LE chipset with 64bit PCI bus, two intel Pentium III processors with 1 GHz each, 30

31 3.2 Hardware Setup 512 MB SD-RAM PC133 memory, two 1000-Base-SX network interface cards with Alteon AceNIC chipset with 64bit PCI interface, on-board 100-Base-TX NIC with intel chipset for management purposes. The same hardware setup is utilized for the load generator. We used up to ten real servers and one NTP server with identical hardware: mainboard with VIA Apollo KT133 chipset (VT8363A north bridge and VT82C686B south bridge), AMD Athlon Thunderbird processor with 900MHz, 256 MB SD-RAM PC133 memory, two 3Com 100-Base-TX PCI network interface cards. A 24 port Cisco Catalyst 3500XL switch with two 1000-Base-SX GBIC modules connects the load generator, the load balancer and the real servers. It supports the use of SNMP and RMON for monitoring the switch internals. The 100-Base-TX NICs used for management purposes are connected to another switch to minimize the influence of management traffic on our measurements. The Gigabit Ethernet ports are connected to the load generator and the load balancer whereas the real servers are connected to the Fast Ethernet ports of the switch. 31

32 3 The Web Cluster Laboratory 32

33 4 Measurement Concepts One method for assessing the performance of computer systems is to conduct measurements. Although a real implementation of the system is needed, typically in a laboratory setup, only this method allows to obtain real-world data that can be used in further performance studies like analytical or simulation models, since many aspects of a the dynamics of systems cannot easily be determined purely from the specifications. This is more relevant the more complex the studied system is. Measurements are classically characterized in the following categories [36]: Active measurements versus passive monitoring Event driven measurements versus sampling Summary versus event oriented performance evaluation Software, hardware and hybrid monitoring During active measurements, the object system is observed while synthetic load is generated. This allows a well defined workload to be applied to the system and minimizes the effects of uncontrolled activities. Passive monitoring is applied to evaluate the system under real-world conditions, where the workload is generated by actual user interaction without influencing its behavior by applying synthetic load. Sampling is the process of observing the system at regular time intervals and recording performance data like statistics of resource utilization for this interval. E.g. the measurement the CPU load of a system is often performed by sampling, i.e. by recording the fraction of time the CPU was busy during a certain period. Event driven measurements are usually used to obtain fine-grained performance data. During this process, timestamps are recorded for relevant points. These points 33

34 4 Measurement Concepts might for example mark the beginning and the end of a calculation. These points are called events and are timeless, whereas the the periods of time which are marked with events for the start and the end are called activities. Their duration can be calculated as the difference of the timestamps. This leads directly to the difference between summary and event oriented performance evaluation: When statistical measures like mean values or quantiles are collected during the measurement, we speak of a a summary performance evaluation. This is most common for sampling. The results of a event driven measurement allows for event oriented performance evaluation, where important aspects are recorded with timestamps. This makes it possible to calculate various performance data after the measurement process and allows to use the data for a detailed input modeling to be utilized in a performance model of the system using the timestamps recorded. Statistics like the probability density function of the duration of system activities can be calculated in this way. Event driven performance evaluation is usually done using three different steps: Event recognition, where the measurement system is triggered to generate a timestamp, the generation of the timestamp itself and the recording of the event record which usually consists of the timestamp and an event identifier that allows to distinguish between the different events. The sequence of the event records is referred to as the event trace. There a three basic ways to perform event driven measurements: hardware monitoring, software monitoring and hybrid monitoring. All those monitoring possibilities differ by the method used to conduct the three steps mentioned above. In hardware monitoring, all three steps are done in hardware. That means that a dedicated piece of hardware is needed to recognize an event. In relatively simple systems like small electronic controller units, this step can be as easy as snooping the address bus of the microcontroller and reacting to a certain activity like writing to or reading from a certain address to trigger the recording of an event record. The generation of an event identifier involves obtaining the relevant information from the system using another hardware component. When a write action to a special address marks the begin of a relevant action, a simple example for a unique event identifier might be this special address and could be received from an additional bus interface that snoops the bus interconnecting the memory bus of the processor. The resulting event records are recorded by a dedicated hardware event 34

35 recorder. Figure 4.1 illustrates the hardware monitoring setup of a small embedded system where the processor bus to the main memory is monitored to trigger the event recording and to obtain the event identifiers using a bus interface. The main advantage is that the performance of system to be measured, the object system, is not influenced by the monitoring process, since all additional activities required for the monitoring are done in additional hardware. Furthermore, the precision and resolution of the timestamps generated does not relate to the system clock and can thus be influenced by the hardware used to conduct the measurements. But most complex architectures like modern servers or desktop computers have certain characteristics that makes this method impractical. For example, all CPUs used in this context use memory management units (MMUs) that introduce a layer of abstraction between memory access in the program code and the physical memory. Therefore, accesses to specific memory locations are not easily seen on the address bus. Additionally, multi-level caches in these architectures prevent the CPU from accessing outside memory in some cases at all. Activities on higher levels are thus not easy to recognize using hardware monitoring. For the reasons mentioned, it is also hard to determine proper event identifiers. Figure 4.1: Hardware Monitoring In contrast, software monitoring shifts all three steps to software running on the object system itself. While it is easy to trigger the logging of event entries at relevant points in the flow of execution of the program and to generate meaningful event identifiers, it might prove complicated to generate exact timestamps due to internal delays caused by other components of the object system like operating system processes under certain circumstances, especially when the event is asynchronously 35

36 4 Measurement Concepts triggered by external hardware like packets arriving at the network interface. Furthermore, the generation of performance data, in this case called instrumentation, can affect the performance of the system considerably. The accuracy and resolution of the timestamps depend on the properties of the clock source used. Since it has to be a clock in the object system, it is challenging to improve this step. Figure 4.2 shows an example for a software monitoring solution, where the IP stack of the object system has been instrumented to generate events that are timestamped with the operating system clock and recorded in a buffer in kernel space that can be read by an user mode process. Figure 4.2: Software Monitoring To reduce the complexity of the event detection and trigger generation in pure hardware monitoring, a combined approach with methods from software monitoring sometimes proves feasible. This combination is often referred to as hybrid monitoring. In this case, all or some events are detected in software. The event recording and timestamping are usually done in hardware. So the software instrumentation on the object system has to provide trigger signals for event recorder. An example of a hybrid monitoring system with event recognition in software where the event identifiers and timestamps are created in hardware using a bus interface is shown in figure 4.3. The event identifier can be determined by a piece of dedicated code or hardware. While this method allows for high precision timestamps and a relatively easy instrumentation for activities on higher levels, dedicated hardware is needed nonetheless. 36

37 Figure 4.3: Hybrid Monitoring Most modern computer architectures include some form of communication. The aspect of communication is not only important for large servers on the Internet, even small embedded devices are often equipped with network interfaces today. Some examples for such systems are electronic control units (ECUs) in automotive applications, where more than 70 devices exchange messages over a number of different bus systems in a current upper class car, or the wireless sensor nodes, small devices that include a low power central processing unit, a number of sensors to record environmental data and some form of radio communication. Therefore, communication plays an important role in performance evaluations of computer systems. For a thorough study, it is not enough to assess the performance of one system in isolation, the interaction with other systems has to be taken into account. While conducting measurement studies of distributed systems, these aspects need to be handled. One important aspect that arises from this demand is that if the communication itself is viewed as an important activity, a global time base for all components of the object system is unavoidable, since the event that marks the beginning of a communication activity and the event that marks its end are generated on different components. In a pure software monitoring process where all timestamps are created using the different clocks of the components of the object system, these clocks have to be set into relation to determine inter-component communication delays. 37

38 4 Measurement Concepts 4.1 Computer Clocks Using the clocks of the object system to generate timestamps for the events imposes numerous problems. Traditional Unix clocks use an internal structure to represent the time in counters that are incremented in jiffies. A jiffy is generated by the clock interrupt. The programmable interrupt controller is instructed to generate an interrupt every 1/Hz second [78]. For Linux kernel versions up to 2.4, the standard value of Hz was 100. Although the possibility to change this number existed in Linux, this was hardly ever done, since it also influenced other system aspects like the granularity of the process scheduler. A number of ways have been proposed to interpolate between successive jiffies. The time stamp counter (TSC) of modern CPUs was used in Linux 2.4 for this purpose when it was available. The time stamp counter is a 64 bit cycle counter inside the CPU that is increased with the internal CPU clock frequency and can be read like a normal CPU register using special opcodes. The kernel calls to read the wall clock time cause a context switch in most operating systems, whereas the TSC can be read both in kernel and user mode without a context switch. The time to read the clock is shown in figure 4.4. This figure was generated using a small program to read the clock 1,000 times and calculate the difference of successive timestamps. The mean time to read the clock using the gettimeofday() call is around 4 microseconds, whereas the mean time between successive read instructions for the TSC only is about 40 nanoseconds. Timekeeping in the Linux kernel has undergone several changes in 2.6 versions. The first one was the increase of the value Hz from 100 to This change improved the granularity of both the timer ticks and the scheduler to below one millisecond. Newer versions of the 2.6 kernel series include a number of changes in the handling of timers and a flexible handling of clock event sources implemented in the Generic Time-of-Day subsystem [80]. This system enables the kernel to use different hardware elements like the local APIC to generate clock events. A clock event in this context is similar to the traditional ticks, but using the new subsystem, the scheduling of operating system tasks is decoupled from the generation of timer interrupts by the clock event source. Newer modifications also changed the handling of kernel timers to a large extent [25]. 38

39 4.2 Clock Errors There are several approaches to improve timekeeping in the Linux kernel. One of the most sophisticated one is the PPS API patch [93] for Linux versions 2.4 developed by Ulrich Windl. This kernel extension is based on the nanokernel [57] by David Mills as implemented in current FreeBSD systems. It uses the TSC to interpolate between timer ticks to provide nanosecond time resolution. Besides decreasing the granularity of the clock, it also implements the PPS API [62], an interface for generating timestamps for PPS pulses (cf. section 4.3.3). Latency [ns] gettimeofday() Latency [ns] rdtscll() Index Index Figure 4.4: Latencies for Reading the Time 4.2 Clock Errors In common computer architectures and operating systems, most clock sources are triggered by a central quartz oscillator. This oscillator is often also used as a frequency source of the CPU. For that purpose, a clock multiplier and divider is used to generate higher frequencies from the frequency of the quartz. Due to the manufacturing process, all quartz oscillators have a systematic frequency error. That means that the frequency of the oscillator is higher or lower than the nominal frequency specified. The frequency error is in the order of around 100 ppm for the oscillators of common PC hardware. Using more sophisticated manufacturing methods, far more precise quartz oscillators can be made. This involves using a different cut of the quartz crystal and a method to fine tune the frequency by small amounts using a mechanical or electrical device. The frequency error described above is a systematic error. That means that the frequency difference can be determined by measuring over longer periods of time 39

40 4 Measurement Concepts and compensating for the errors. This can be done by calculating new time readings from the raw clock readings after the measurements or by changing the amount of time the internal clock of the operating system is increased at every timer interrupt before the measurement is done. Besides this systematic frequency error, all quartz oscillators have a temperature dependent error component. This component is some orders of magnitude lower than the systematic error. Nonetheless, they contribute to the error of the clock readings and sum up over the time when considering time measurements. This effect becomes more and more important the longer the measurement takes. Since the frequencies of the oscillators change over time, this effect can not easily eliminated. The manufacturers of high precision oscillators offer devices that use a temperature compensation inside the oscillators. This compensation can be achieved using analog circuits in temperature compensated external oscillators (TCXOs) or using digital logic in digitally temperature compensated external oscillators (DTCXOs). Another method to eliminate the effect is the use of a small oven that heats the crystal to a constant temperature above room temperature. These components are called oven controlled external oscillators (OCXOs) and offer the highest precision of all available quartz oscillators. Figure 4.5 illustrates the effect a change in the temperature has on the CPU frequency of a system. In this experiment, we observed the frequency of the CPU by reading the TSC value of a 900 MHz Athlon CPU (measured mean frequency f n = MHz) in regular intervals of τ 0 = 1 s by generating an interrupt triggered by a signal sent precisely with a frequency of 1 Hz by our GPS hardware. The plotted values are filtered using an averaging algorithm over τ = 32 s to eliminate TSC reading errors caused by the interrupt latency of the system. Since the quartz oscillator had no temperature sensor attached to it, we used the sensor of the southbridge chipset to determine a general temperature tendency of the computer. The reason for the oscillation of the frequency with a period of about 40 minutes was found to be temperature change caused by the duty cycleof the air conditioning system. Figure 4.6 shows the amount of the frequency variation caused by the temperature change over a period of 95 hours. Besides the cycle mentioned above, the graph shows a lower frequency at time t 31.2 h, t 55.2 h and t 79.2 h. Since the measurements were done in June, the outside temperature increased to a value the air conditioner was unable to compensate for during that time of the three days. 40

41 4.2 Clock Errors Frequency [MHz] Frequency and Temperature Time [h] Temperature [ C] Figure 4.5: Frequency Changes with Temperature Due to the directional exposure of the server room, the peaks in temperature and thus in the measured frequency occur with an interval of 24 hours. The frequencies occurring at the three different temperature levels the temperature sensor provided are shown as histograms in figure 4.7. The effect of the temperature can determined and used for a frequency correction. But as the figure shows, a higher temperature resolution is needed to mitigate the influence. A more exhaustive evaluation of these effects and the influences of temperature and power management has been done by Stefan Schreieck in [74]. The results back up the assumption that a more precise time keeping could be achieved in modern operating systems by changing the amount of time added to the current value of the system clock by taking into account the actual temperature of the main oscillator. Since software monitoring is based on generating timestamps, it is not the frequency error that matters itself, but the phase error, the offset of the clock with respect to a reference clock. In case of measurements of one-way delays in the absence of a reference clock, the difference of the current values of the system clocks is added to each value obtained, since the timestamp for the sending event 41

42 4 Measurement Concepts Frequency Error and Temperature Frequency Error [ppm] Temperature [ C] Time [h] Figure 4.6: Frequency Variation is generated in the source system and the timestamp for the receive event by the sink. Even when both systems have zero clock offset in the beginning of a measurement, the temperature dependent frequency errors cause a phase error during the measurement. A remaining frequency difference of merely one ppm leads to a phase difference of one microsecond after one second. Figure 4.8 illustrates this effect, where the offsets of the clocks of two PCs compared to a GPS-based reference clock are plotted over the time. Both PCs were located in close proximity in an air-conditioned room. The systematic frequency error has already been eliminated before the measurement was started. One would expect that the phase errors of both systems evolve in a similar way, but slight differences in the internal temperature and in the cutting of the quartz crystal during manufacturing causes both systems to behave differently. Since the delays measured in our local area network are around 60 microseconds, this difference influences the measurement considerably. To evaluate the effect, we transmitted UDP packets between two computers, PC1 and PC2. Timestamps were generated both at the time when a packet was sent and when a packet was received with the clock of the respective systems as usual in 42

43 4.2 Clock Errors Frequency Temperature 24.6 C Frequency Temperature 25.2 C Frequency 0 40 Temperature 25.3 C Frequency [MHz] Frequency [MHz] Frequency [MHz] Figure 4.7: Frequency Distribution software monitoring. Assume a packet is sent from PC2 to PC1. Let then t 2,s2 (i) be the timestamp generated by PC2 when sending the i-th packet and t 1,r1 (i) the timestamp generated by PC1 when receiving this packet. The measured one-way delay is calculated as d 2,1 (i) = t 1,r1 (i) t 2,s2 (i). Assume PC1 has a constant offset o (phase error difference) compared to PC2, i.e. the time t 1 (t) on PC1 and the time t 2 (t) on PC2 at real time t differ by a constant amount of o for all values of t: t 1 (t) t 2 (t) = o(t) = o t. When t 1,s2 (i) denotes the time of the clock of PC1 when the packet was sent at PC2, the correct one-way delay d 2,1 (i) can be calculated as d(i) = t 1,r1 (i) t 1,s2 (i). Since we know that t 1 (t) = t 2 (t) + o, we can determine d 2,1 (i) = t 1,r1 (i) ( t 2,s2 (i) + o) = d 2,1 (i) o. 43

44 4 Measurement Concepts Phase Errors Offset [ms] PC1 PC2 Difference PC1 PC Time [h] Figure 4.8: Phase Errors Similarly, we can analyze packets sent from PC1 to PC2: d 1,2 (i) = t 2,r2 (i) t 1,s1 (i) d 1,2 (i) = t 2,r2 (i) t 2,s1 (i) = t 2,r2 (i) ( t 1,s1 (i) o) = d 1,2 (i) + o. Thus, and d 2,1 (i) = o + d 2,1 (i) d 1,2 (i) = o d 1,2 (i). These two quantities, together with the offset of the clock of PC1 from the clock of PC2 are shown in figure 4.9. The main reason for the phase differences were different reactions of the quartz oscillators to the change of temperature. The variable part of the frequency error itself can be neglected in the calculation of the one-way delays, as it is below 1 ppm and thus contributes less than 1 ppm to the measurement error of each delay calculation. 44

45 4.2 Clock Errors Figure 4.9: UDP Delays But not only frequency errors cause phase errors. Delays when reading the clock appear as phase errors, too. This can be the case both when reading the clock for timestamping and during the time synchronization process [59]. More sophisticated approaches to characterize the error involved in time and frequency measurements are presented in the technical note [81], which contains a number of the articles about this topic. The most important ones of these in our context are [47], [33], [1] and [17]. Both papers cited are based on the assumption that two clocks with their oscillators are compared. All analyses use a set of data of the fractional frequency or time fluctuations between these clocks. The first distinction that has to be made is the one between non-random and random fluctuations. Non-random fluctuations can be easily determined and predicted. Suppose one oscillator has a constant frequency difference to the other. Then, the time difference between the two clocks will constantly increase in a linear way. As the frequency difference can be estimated as the mean value of the frequency measurements, 45

46 4 Measurement Concepts the phase errors (time differences) caused by this type of can be predicted. Thus, these effects are called systematic. Another systematic fluctuation would be a linear frequency drift (linear change of the frequency) that leads to quadratically departing phase fluctuations. After determining, predicting and eliminating the systematic fluctuations, a set of errors remains in the data. This set of errors contains the random errors and has to be characterized using statistic methods either in the Fourier frequency domain or in the phase domain. When one oscillator is compared to a reference as in the setting described above, y(t) denotes the instantaneous normalized frequency deviation from the nominal frequency ν 0 at time t and φ(t) the phase deviation in radians from the nominal phase 2πν 0 t. They relate to each other as y(t) = 1 dφ(t) = φ(t). 2πν 0 dt 2πν 0 Another important measure is the phase deviation x(t) expressed in units of time x(t) = φ(t) 2πν 0. One main observation when dealing with clock readings is that noise processes that cause errors are often not of Gaussian form, and the processes are not stationary. This is the reason why traditional measures like the mean or standard deviation do not provide valid predications Classification in the Frequency Domain The frequency and phase error processes can be classified in the frequency domain using their one-sided spectral densities, i.e. spectral densities where the Fourier frequencies range in the interval 0 to. These spectral densities can be determined for all quantities defined above. S y ( f ) denotes the one-sided spectral density of y(t), S φ ( f ) the one of φ(t), S φ ( f ) of φ(t) and S x ( f ) of x(t). 46

47 4.2 Clock Errors The relation between the different spectral densities can be expressed by the following equations: S y ( f ) = f 2 ν 2 S φ( f ) 0 S φ ( f ) = (2π f ) 2 S φ ( f ) 1 S x ( f ) = (2πν 0 ) 2 S φ( f ). A common way to characterize the instabilities is to plot the spectral densities over the Fourier frequency. The most important fluctuations are often represented as a sum of five different noise processes using power-law spectral densities for S y ( f ): 2 α= 2 h α f α for 0 < f < f h S y ( f ) = 0 for f f h. In this equation, h α is a scale factor, α an integer between 2 and 2 and f h is the cut-off frequency of a low-pass filter. The five noise processes can be identified in a logarithmic plot of S y ( f ) over the logarithm of the Fourier frequency f as depicted in figure Figure 4.10: Power-Law Spectral Densities In this log log plot, α appears as the slope of the line that relates S y ( f ) to f, whereas h α is the amplitude of the corresponding noise process. The area in the plot where the slope α is 2, the noise process is white phase (PM) noise, which is often induced by the measurement process. When this noise process 47

48 4 Measurement Concepts is part of the signal, it is mainly caused by the devices used for amplification of the signal. The same reason is also the cause of flicker phase (PM) noise that appears in the area of the graph where α = 1. Another reason for the presence of this noise component is the use of frequency multipliers that are often used on PC mainboards and CPUs to generate a higher frequency signal from a lower frequency quartz oscillator output. White frequency (FM) noise with α = 0 appears often when a slave quartz oscillator is locked to the frequency output of another device. This is the case for cesium and rubidium oscillators as well as in GPS receivers where a quartz oscillator is disciplined by atomic clocks present in the satellites of the GPS constellation. Flicker frequency (FM) noise may be caused by physical resonance mechanisms of active oscillators and by environmental properties. It is identifiable as the the area with slope a equal to 1 in the log log plot. The fifth noise component, random walk frequency (FM) noise or white frequency aging, is visible as an area with slope 2. It is related to the physical environment of the oscillator and can be caused by mechanical shock, vibration and changing temperature. All these effects result in a change of the frequency Classification in the Time Domain As the data observed when characterizing clock errors do not belong to stationary processes, classic measures like the mean and the standard deviation do not provide meaningful results. The standard deviation for clock error measurements will often increase with the number of samples included in the calculation. Therefore, these measures cannot be used to compare the performance of different clocks. A measure used commonly for classification of clock errors in the time domain is the two-sample Allan variance [2]. An intuitive introduction of the Allan variance is presented in [47]: From the time differences (phase error) of two clocks at times t 1 and t 2 = t 1 + τ denoted by x(t 1 ) and x(t 2 ), the frequency difference during this interval can be estimated as ȳ 1 = x(t 2) x(t 1 ). τ The time difference at time t 3 = t 2 + τ can be estimated as ˆx(t 3 ) = x(t 2 ) + ȳ 1 τ = 2x(t 2 ) x(t 1 ). 48

49 4.2 Clock Errors This estimation is based on the assumption that the frequency in the interval from t 2 to t 3 is the same as in the previous one from t 1 to t 2. The prediction error є = x(t 3 ) ˆx(t 3 ) is proportional to the difference of the frequency errors ȳ 2 ȳ 1, assuming that ȳ 2 is the frequency error in the interval t 2 to t 3. Expressed using time measurements, the prediction error is proportional to x(t 3 ) 2x(t 2 ) + x(t 1 ). τ One half of the mean-square value of this quantity is called the two-sample Allan variance for an averaging time of τ, denoted by σ 2 y(τ). Therefore, σy(τ) 2 = ȳk+1 ȳ k, 2 where the angled brackets <> denote an infinite time average for the adjacent samples t k+1 = t k + τ, which are thus time difference measurements with a fixed sample rate 1/τ. This results in frequency estimates ȳ k with zero dead time, where ȳ k = 1 t k+1 τ y(t)dt t k = x(t k+1) x(t k ). τ From the equations above, it can be seen that a constant frequency offset does not influence the Allan variance, the measure therefore does not cover frequency accuracy. The square root of the Allan variation is called the Allan deviation σ y (τ). A more efficient method for calculating the Allan variance [33] can be obtained for measurements with a constant rate 1/τ 0 using overlapping estimates as σ 2 y(τ) = N 2m 1 2(N 2m)τ 2 i=1 (x(t i+2m ) 2x(t i+m ) + x(t i )) 2, where N is the original number of time difference measurements spaced by τ 0, M = N 1 the number of frequency error measurements of sample time τ 0 and τ = mτ 0. 49

50 4 Measurement Concepts Figure 4.11: Allan Deviation As it is the case for power-law spectral densities, the different error process components can be seen in a plot of the logarithm of the Allan variance σy(τ) 2 or of the Allan deviation σ y (τ) over the logarithm of the averaging time τ as different slope characteristics. Figure 4.11 shows a typical plot of the Allen deviation for the five independent noise processes. Table 4.1 summarizes the error processes and slopes in different plots. This table also shows that the Allan variance does not show different slopes for white and flicker phase noise processes. For that reason, another two-sample variance has been developed, the modified Allan variance. It is defined as Mod σ y 2 (τ) = 1 2τ 2 [ 1 n n i=1 2 (x i+2n 2x i+n + x i )] and allows to distinguish between white and flicker phase noise processes in a log log plot of Mod σ 2 y(τ) versus τ as areas with slopes 3 and 2, respectively. Like the normal Allan variance, the modified Allan variance can also be determined using overlapping estimates. For N time measurements spaced by τ 0, the modified Allan variance can be calculated for a chosen τ = mτ 0 as Mod σ 2 y(τ) = 1 2τ 2 m 2 (N 3m + 1) N 3m+1 j=1 j+m 1 i=j 2 (x(t i+2m ) 2x(t i+m ) + x(t i )). Since the time dispersion is the primary concern in our field of application, the time variance (TVAR) [79] is especially useful. It is an estimator for the timing 50

51 4.3 Reference Clocks Table 4.1: Slope Characteristics Noise Process Frequency Domain Time Domain S y( f ) S φ( f ) σy 2 (τ) σ y(τ) Mod σy 2 (τ) σx 2 (τ) White Phase Noise Flicker Phase Noise White Frequency Noise /2-1 1 Flicker Frequency Noise White Frequency Aging /2 1 3 errors caused by frequency variations. It is designated σ 2 x(τ) and can be calculated using the modified Allan variance as σ 2 x(τ) = τ2 3 Mod σ 2 y(τ). Another advantage of the time variance besides being interpretable in the time domain is that it can be used to easily identify the onset of the domain in which the spectrum is dominated by white frequency noise as the point in the log log plot where the slope changes from zero to one. The time deviation σ x is defined as the square root of the time variance. Fast algorithms for calculating these measures have been published by Bregni in [10]. We implemented these algorithms in R for our experiments. 4.3 Reference Clocks To evaluate the accuracy of clocks and to synchronize them, a reference standard is needed. Nowadays, the most common reference clocks for computer systems are NTP servers [59] that distribute timing information received by other reference clocks over the network. David Mills, the inventor and maintainer of NTP, has evaluated the performance of his system under real-world conditions and discovered that the synchronization accuracy of NTP over LAN links is in the order of 10 µs with spikes up to 100 µs due to varying network delays caused mainly by the queuing delays in network elements like switches and network adapters. Over WAN links, the accuracy is even more impaired by additional components such as routers in the transmission path of the datagrams used for time transfer. 51

52 4 Measurement Concepts Therefore, one can expect an accuracy of about 5 ms in the Internet, but errors up to 100 ms have also been observed [58]. These aspects lead to the conclusion that the achieved precision is not sufficient to estimate accurate distribution functions for measured one-way delays, even when an NTP server is available in the local area network NTP Despite the limitations mentioned, it is worthwhile looking at the functionality of the NTP protocol and internal mechanisms of NTP servers, since the system is both used as one component in our solution and provides ideas for own implementations for time synchronization solutions. This description of NTP follows [58], as this monograph by David Mills is the most comprehensive and detailed work on this topic. Figure 4.12: NTP Time Transfer As the internal oscillators of computers were not chosen with precision timekeeping in mind, undisciplined clocks of different computer systems tend to differ both in phase and frequency. The idea of NTP is to use a network connection between two systems to transfer timing information. The same NTP software is used on both sides of the connection. That means that there is no need for different software on the client and the server. A system that acts as a client to some server can also act as a server to other systems. This creates hierarchies of NTP servers. The roots of these hierarchies are referred to as stratum 1 servers and are usually connected to some reference time source other than NTP. When descending the hierarchy, the stratum number increases, i.e. the next level servers are have stratum 2. For 52

53 4.3 Reference Clocks determining the clock offset, the protocol specifies a protocol data unit (PDU) that, among other information, can hold three timestamps referred to as the origin, receive and transmit timestamp. A 64 bit format with a resolution of 232 ps is used for all packet timestamps. For the time transfer as depicted in figure 4.12, a client generates a NTP PDU and fills the origin timestamp with the current local time T 1 and sends the PDU to the server using the connectionless protocol UDP. Upon receiving the datagram, the server fills the receive timestamp field with its current clock reading T 2. Just before sending the datagram back to the client, the transmit timestamp is filled with T 3. When the client receives the packet, it immediately generates a fourth internal timestamp T 4 that enables it to calculate both the time offset θ = 1 2 [(T 2 T 1 ) + (T 3 T 4 )] and the round-trip delay δ = (T 4 T 1 ) (T 3 T 2 ) of the datagram. Figure 4.12 shows the transfer of timestamps used by NTP. While it is obvious that the calculation of the round-trip time is always correct, the calculation of the offset assumes that the delays in the network are symmetric, i.e. T 4 T 3 = T 2 T 1. This time transfer is repeated a number of times and the measurement with the smallest round-trip delay is selected to be used in further calculations, as it is assumed that the lowest delay is an indication for the least queuing and thus most symmetry. Since a client usually has associations to more than one NTP server, these measurements are then processed using filters and selection, clustering and combining algorithms to generate a guess of the local clock offset from all the values of different servers. This selected value can then be used to discipline the clock. Successive measurements allows to determine the frequency of the local clock over an averaging interval τ. The clock is disciplined by changing the amount of time that is added to the local clock in each clock update cycle. That means that the clock operates as a variable frequency oscillator (VFO). The amount of change is determined by a feedback loop. Depending on the averaging interval over which the current frequency is determined, a phase-locked loop (PLL) or frequency-locked loop (FLL) is used to control the frequency adjustment. An overview of the architecture of an NTP server is shown in figure The same mechanisms can be used for controlling the clock of stratum 1 servers, but the timing reference is a clock source directly connected to the system without 53

54 4 Measurement Concepts Figure 4.13: NTP Architecture any network in between Time Sources When looking at the reference clocks for stratum 1 NTP servers, two main options exist in Germany: DCF77 and GPS receivers. The DCF77 signal is a coded time signal transmitted using a 77.5 khz radio frequency. The distributed time is the official German time reference generated by the Physikalisch-Technische Bundesanstalt (PTB). The sender is located in Mainflingen near Frankfurt am Main. While the time sources for the signal are highly precise atomic clocks (cesium and rubidium frequency references) that provide an accuracy of 300 ns for the start of each second, the phase and frequency errors caused during the propagation of the long-wave signal are several orders of magnitude higher [69]. A higher precision can be achieved using GPS technology. The Global Positioning System (GPS) [5] uses a number of orbiting satellites (up to 31) equipped with precise atomic clocks that transmit time information. All satellites are synchronized to a common time base. Several ground control stations monitor the satellite clocks with respect to the time base and send correction information to the respective satellite in case of a discrepancy. For positioning purposes, the reception of time information of four satellites allows to calculate the current position of the receiver, 54

55 4.3 Reference Clocks since the positions of the satellites and the propagation delays of the signals are known. If the GPS receiver was equipped with a precise time base, the reception of three satellite signals is sufficient to calculate its position, but since this is not the case for most commercial receivers, four is the minimum number needed [77]. In many places of the earth, more than the four satellites needed for positioning are in view all of the time. The system can be used to obtain highly precise timing information. For a receiver in a known fixed position, only one satellite has to be in view to acquire the timing information The PPS API RFC 2783 [62] specifies an application programming interface (API) to use PPS pulses of an external reference clock directly connected to a stratum 1 NTP server for time synchronization. Many clock sources are capable of providing a signal where a level change marks the beginning of a second with high precision. This signal is called pulse-per-second (PPS) output. The PPS API provides a facility to timestamp level changes of signals delivered to a system interface with high resolution. Since the serial port of the system was often used to connect external clocks, the data carrier detect (DCD) pin of a serial port is commonly used for the PPS signal input. Since the PPS pulse marks the beginning of the second, the timestamps generated allow to determine the offset of the local clock with respect to the reference clock, provided that the offset is less than half a second. Larger offsets can not be determined using the PPS API alone, since from the timestamp itself, it can not be seen to which second (minute, hour, day, month and year) the level change of the signal level belongs. Since the time difference between two successive PPS pulses is exactly one second, the frequency error of the receiving system can also be calculated from these measurements. The PPS API is available as a patch set for Linux kernels of both the 2.4 and the 2.6 series. The Linux PPS API kernel patch for kernel versions 2.4 [93] modifies the Linux serial port driver for detecting and timestamping signals delivered to the DCD pin of the serial port. In addition to PPS recognition, this patch also extends the timekeeping resolution of the the Linux kernel to one nanosecond by utilizing the timestamp counter (TSC). In the 2.6 kernel series, the patch just enables the 55

56 4 Measurement Concepts timestamping of the pulses, but does not change the general timekeeping properties such as the resolution of the clock. The timestamps of the PPS pulses can be used in two ways to discipline the kernel clock: either by using the hardpps() kernel consumer or by using the user level NTP daemon. Both of them make use of the kernel model for precision timekeeping as specified in RFC 1589 [56] and estimate the frequency error of the local oscillator by averaging the measured time interval between successive PPS pulses. Figure 4.14 shows how an NTP process uses the timestamps provided by the PPS API to discipline the operating system clock that is used for timestamping the PPS signals. Figure 4.14: NTP and the PPS API The main advantage of using the PPS API is the low latency of the timestamping process, as the implementation usually instructs the system to invoke a special interrupt handler to generate the timestamp upon reception of the external pulse causing an interrupt. In a careful implementation, the only variable time interval between the hardware logic level change and the generation of the timestamp is thus just the interrupt latency. This interrupt latency changes with other simultaneous tasks of the system and can become relatively high, especially when other tasks involve communication with hardware. On the other hand, reception of an NTP PDU over the network includes buffering both in network components and the network interfaces of the sending and receiving system. These buffer effects are usually several orders of magnitude worse than the interrupt latency. Additionally, the reception of Ethernet frames often also involves the invocation of interrupt handlers and thus these measurements also include the interrupt latency as an 56

57 4.3 Reference Clocks additional error component. 57

58 4 Measurement Concepts 58

59 5 Dedicated Measurement Infrastructure It was obvious that precise measurements of one-way delays are needed as a base for a fine grained performance analysis of distributed systems. Due to the structure and complexity of the network hardware used, it became clear that it would not be feasible to implement a hardware monitoring solution for this purpose. The most important problem with such a solution is to recognize the beginning of a transmission and to record all relevant header information of the packets. Since HTTP over TCP/IP is used in our setup, recording the source and destination IP addresses, TCP port numbers and sequence numbers is extremely useful to reconstruct the data exchange after the measurement. Even if it were possible to detect transmissions in hardware, recording these data from layers 2 and 3 required a considerable hardware complexity in the event recorder. Therefore, a hybrid or pure software monitoring approach seemed more promising. We evaluated the possibility to use the hardware monitor ZM4 [18] for a hybrid monitoring approach. The ZM4 system is a distributed hardware monitor that uses a central measurement timing generator (MTG) and a number of distributed dedicated probe units (DPUs). The DPUs are synchronized to the common measurement timing pulses generated by the MTG using a twisted pair connection. The DPUs generate an internal timing signal with a resolution of 100 ns. This local clock of the DPUs is used to timestamp the events which are recognized by the trigger logic and recorded internally in the DPU. The sustained event recording rate is limited to 10,000 events per second. A fully utilized 1000-Base-T Ethernet connection can transmit up to 83,333 packets per second, even if we assume quite large Ethernet frames of 1,500 bytes. Therefore, the hardware we had at hand was not able to handle this event rate. 59

60 5 Dedicated Measurement Infrastructure For that reason, a pure software monitoring approach was the only option. But since the clocks of the object system are used to generate the timestamps and since the nodes of the cluster system are completely independent, a method for synchronizing these clocks is needed. Our Experiments have shown that the precision achieved by estimating the clock skew from network delay measurements without a GPS reference clock like in [66, 63, 67, 31] is not sufficient for determining the distributions of the delays in our system. These techniques were introduced to estimate the phase and frequency difference of clocks used to timestamp packets transmitted over an Internet link, where the resulting transmission delays are several milliseconds long. In an environment with low-latency LAN links, these methods proved as not effective, as the variations of the transfer delays are relatively large compared to the clock differences. Therefore, all these approaches would lead to false predictions. Most of the methods are based on using statistical characteristics of the transfer delays, usually a minimum of the estimated one-way delay over a certain period of time, and assume a symmetric link. The use of the minimum is justified, as all effects that affect the transmission delays in an unpredictable way like queuing delays in routers and switches lead to longer delays. Thus, when using the minimum, the packets which are least affected are chosen for the estimation of the clock differences. A detail of the plot of UDP transmission delays in figure 4.9 is shown closer in figure 5.1. When looking at this graph it becomes clear that the calculated minimum used depends heavily on the chosen period of time, and that the minimum of the transfer delays is not distributed symmetrical in both directions. Even if the general trend of the clock error difference can be estimated using these approaches, the determined distribution of the calculated transfer delays is not exact enough to be used in a sound input model. A detailed evaluation of these methods can be found in the thesis of Johannes Dodenhoff [19]. In earlier stages of our experiments with the web cluster, we equipped each node of the cluster with a dedicated GPS-based time source, a Meinberg GPS167 PCI card. These receivers need to be installed in the respective nodes and share a common roof-mounted antenna. Using these GPS receivers directly as a time source where every timestamp needed is generated by the clock of the PCI card proved to be not efficient, as every reading of the clock caused a context switch from the user mode to the kernel mode and took time in the order of several microseconds. Another disadvantage of this approach is that every node of the object system on which performance measurements have to be conducted must be equipped with an own GPS receiver. Besides causing high 60

61 UDP Transmission Delay [ms] Phase Difference PC1 PC2 Delay PC2 to PC1 Negative Delay PC1 to PC Time [h] Figure 5.1: Detail of UDP Delays costs, it also limits the applicability to systems with interfaces like PCI for which dedicated GPS hardware is available. The solution we found to this problem was to use the standard operating system clocks of the object systems for timestamping the measurement events and to synchronize them to our GPS time source periodically during the measurements. The method is based on standard time synchronization tools such as NTP with GPS receivers and own modifications. We use a standard NTP server equipped with a GPS receiver to provide coarse timing information to all other nodes of the system over a dedicated synchronization and measurement network. The GPS receiver has a PPS signal output that is documented to deliver a TTL pulse that marks the beginning of a second with a uncertainty below 500 ns with respect to the GPS time. The new idea implemented here is to distribute the PPS signal to all nodes of the cluster and use this timing source in combination with the networked NTP server as a precision timing ref- 61

62 5 Dedicated Measurement Infrastructure erence. Since the signal levels of TTL are different from the ones in RS-232, we built a 5V-powered level converter using Maxim MAX3225 chips. These chips were selected because of their relatively low propagation delay. One chip can convert two TTL signals to RS-232 levels, so we used seven chips connected on the TTL side to deliver the PPS signal to all nodes of our cluster plus the NTP server [30]. Figure 5.2 shows the architecture of the whole synchronization system. Figure 5.2: Synchronization System As mentioned above, the PPS pulse just marks the beginning of an arbitrary second, but does not contain any information on the absolute time, so all clocks of the cluster nodes must be set to have offsets less than 500 ms. This is the reason for using a standard NTP server on the network. What makes this solution appealing is that the whole time synchronization during the measurements can be achieved using the ntpdate command before starting PPS synchronization and the hardpps features during the measurements, or by using the NTP daemon with a configuration file that contains two time sources, 62

63 5.1 PPS Pulse Latency the PPS clock driver and the NTP server. The hardpps solution has the advantage that no additional NTP process has to be executed during the measurement, but David Mills recommends using NTP for high-precision synchronization, since the in-kernel implementation does not use floatingpoint calculations and is thus less accurate than a user mode NTP process. When using the solution with an NTP process, all nodes of the object system become stratum 1 NTP servers, as the GPS appears to be locally installed by using the PPS connection. When an Internet connection is used as the channel between the different nodes of the object system and the nodes are not placed in the same location, own GPS receivers are needed for the different locations to generate the PPS signal for the time synchronization system. Since the PPS pulses of GPS receivers are derived from the global GPS clock ensemble, the error caused by the use of an additional receiver that possibly has other GPS satellites in view is minimal. Thus, this architecture can also be used for a geographically distributed object system. In any case, no modification of the standard software components is needed for this measurement infrastructure. Our new solution can be implemented using only a special configuration file. 5.1 PPS Pulse Latency In Linux, the recognition of the PPS pulse is done by instructing the hardware to generate an interrupt in case of a signal transition on the DCD pin of the serial port. The interrupt handling routine in the serial port driver is modified by the patch to timestamp every invocation. The PPS API can generate an echo signal on the DSR pin of the serial port to be able to estimate the delay between the PPS pulse and the timestamping using an external clock. This delay d echo is composed of the hardware propagation delay for the incoming PPS pulse d hwi, the interrupt latency d lat, a delay between the timestamping and the generation of the echo signal d ts and the hardware propagation delay for the outgoing echo signal d hwo. While the other delays remain more or less constant and can be compensated for, d lat depends on the the state of the system at the time of the signal reception. Thus, if the time of the generation of n-th PPS pulse is t(n), the time of timestamping this event is t ts (n) = t(n) + d hwi (n) + d lat (n). 63

64 5 Dedicated Measurement Infrastructure and the time the echo pulse is observable as an external signal transition is t echo (n) = t(n) + d echo (n) = t(n) + d hwi (n) + d lat (n) + d ts (n) + d hwo (n). By recording the PPS pulse and the resulting echo with an external clock the value of d echo can be determined. The time of the local clock at the n-th echo generation t loc,echo (n) is the timestamp generated by the PPS API driver. The time of the local clock at the n-th external PPS signal t loc,pps (n) is t loc,echo (n) d ts (n). Thus the differences of two nodes i and k at the time of the n-th PPS pulse can be calculated as t i,k (n) =(t loc,echo,i (n) d ts,i (n)) (t loc,echo,k (n) d ts,k (n)). The delay d loc,ts is not observable, but since d ts (n) and d hwo (n) can be viewed as constant across different nodes and time, a reasonable approximation for t i,k (n) can be calculated as t i,k (n) =(t loc,echo,i (n) d echo,i (n)) (t loc,echo,k (n) d echo,k (n)) = t i,k (n) + d ts,k (n) d ts,i (n)+ d hwo,k (n) d hwo,i (n). The assumption that d hwo is constant for all systems is justified because all signal level converters use the same hardware with low propagation delay and share the same ambient temperature. The generation of the echo signal inside an interrupt handler with other interrupts disabled and the identical hardware on all real servers makes the assumption of a constant d ts also reasonable. The PPS serial port driver is implemented so that the serial port remains usable for general communication besides PPS recognition. Due to this fact, there are several instructions executed between the timestamping and the echo generation. So we decided to implement a driver for the parallel port for exclusive use for PPS signal recognition. This also enabled us to avoid the use of signal level converters since the parallel port makes use of TTL signals as provided by the GPS receiver. Figure 5.3 shows a reduction in the interrupt latency with our driver using the parallel port compared to the standard PPS serial port driver. 64

65 5.1 PPS Pulse Latency IRQ Latency Serial Port IRQ Latency Parallel Port Density Density Latency [µs] Latency [µs] Figure 5.3: Interrupt Latencies The PPS API patch for the 2.4 Linux kernel series also improves the resolution of the system call do_clock_gettime() to one nanosecond. When using this system call from kernel space, there is no context switch involved. The measured mean execution time of one system call on our cluster nodes is 70 ns. This measurement was done by allocating a buffer in the kernel space and writing the result of successive invocations of the system call do_clock_gettime() to that buffer space. The content was read by a user mode tool where we calculated the differences of successive timestamps. It shows that the times d ts (n) for generating the timestamps once the interrupt handler is invoked are short compared to the interrupt latencies d lat (n) with our parallel port driver Echo Feedback The interrupt latency d lat from section 5.1 occurs not only in our setup, but in every system that uses an external reference clock. It makes no significant difference if the clock is connected to a serial or parallel port or a system bus like PCI. Interrupt latencies occur in any case. The calculation of t i,k lead us to the idea of Dynamic PPS Echo Feedback: By measuring the time between the PPS pulse and the generated echo for each pulse with an external clock, we can compensate for the interrupt latency by subtracting d echo (n) from the timestamp t loc,echo (n). The resulting timestamp t loc,echo (n) = t loc,pps(n) d ts (n) d hwo (n) 65

66 5 Dedicated Measurement Infrastructure is lower than the desired t loc,pps by d ts (n) + d hwo (n) but does not depend on the interrupt latency any more. Since the generation of the echo signal is done in the interrupt handler immediately after generating the timestamp, where other interrupt handling is disabled, the point of time where the signal is generated is close to the timestamping of the PPS pulse. Therefore, the delay d echo (n) is a close approximation for the interrupt latency and the use of timestamps calculated as described above leads to a considerable improvement of the quality of the synchronization with respect to phase errors and jitter of the timestamps. This novel concept is used in [85] for the implementation of an improved synchronization system by Gükan Uygur. During the work on his thesis, he created an external clock that is intended to measure the interrupt latency. For this purpose, an field programmable gate array (FPGA) is connected to the parallel port of object system. The FPGA is programmed to increment an internal counter with every tick of an external quartz oscillator connected to the FPGA. The structure of the hardware is shown in figure 5.4. Figure 5.4: External Clock Each time the FPGA receives a PPS signal, the current reading of the counter is latched in an internal register and the counter is set to zero. When the FPGA receives an echo signal, the counter is saved to another register, but the counter is not reset, it keeps counting. The contents of both registers can be read by the object system over a parallel port connection. The interrupt handler in our own implementation of a PPS API driver for the parallel port was modified so that after generating a timestamp for each PPS signal received and generating an echo signal, it reads the values of both FPGA registers. It then uses the counter value c pps (n) 66

67 5.1 PPS Pulse Latency latched for the PPS signal as the frequency of the external oscillator, since the time between two PPS signals is exactly one second and thus, this counter value is the exact frequency of the oscillator during the previous second. Once the frequency is known, the time between the reception of the PPS and the echo signal d echo (n) can be estimated using the counter value for the echo signal c echo (n) as d echo (n) = c echo(n) c pps (n + 1) s ˆd echo (n) = c echo(n) c pps (n) s. The frequency of the external oscillator during the last second is used as an approximation of the current frequency. The error introduced is small, since the frequency changes are also small during the short measurement interval τ 0 = 1 s. The recorded timestamp for the PPS pulse t loc,echo (n) can then be modified by subtracting ˆd echo (n). This modified timestamp ˆt loc,pps = t loc,echo ˆd echo (n) is provided for later use in applications or the kernel hardpps facility through the API as an estimation for the system clock at the time of the PPS pulse. Please note that another error has been introduced by measuring ˆd echo (n) with the external clock whereas t loc,echo is measured using the internal clock of the object system. The error caused by this compensation is only in the range of hundred parts per million of ˆd echo (n) for an undisciplined local clock. It gets close to zero as the frequency of the clock of the object system is gradually disciplined by the PPS pulses and the frequency of the external clock is also determined using these signals. For an optimal performance, the granularity of the external clock should be at least as fine as the granularity of the object system. For a resolution of one nanosecond, an external quartz oscillator with a frequency of 1 GHz and an FGPA that can handle this frequency would be needed. As this is not feasible, the experiments were conducted using a 50 MHz oscillator. With this setup, we were able to achieve large improvements in timekeeping. For a typical trace, the root mean square (RMS) value of the jitter of the PPS timestamps was reduced from ns to 5414 ns. The jitter has been measured as the difference of successive differences of PPS timestamps, i.e. when t loc,pps (n) denotes the timestamp for PPS pulse n, the difference to the next timestamp n + 1 can be calculated as (n) = t loc,pps (n + 1) t loc,pps (n). The jitter is then calculated as j(n) = (n + 1) (n). While the number of large jitter values decreases considerably when applying the echo feedback mechanism, the number of small jitter values increases. This is caused by the limited granularity of the external clock. 67

68 5 Dedicated Measurement Infrastructure σ x (τ) [s] 5e 07 2e 06 1e 05 PPS ECHO Feedback τ [s] Figure 5.5: Time Deviation More detailed evaluations involved calculating the time deviation. The time deviation σ x (τ) has been introduced in section It is an estimator for the time dispersion due to frequency variation. Figure 5.5 was produced by plotting the time deviation for the raw PPS pulses that were received by the PPS API with an undisciplined local clock as crosses and for the PPS pulses corrected by dynamic PPS echo feedback as dots versus the averaging interval τ. Both axes are scaled logarithmically. The measurement process took 24 hours. The frequency of the local clock had been corrected to eliminate the systematic frequency errors as far as possible by determining the overall frequency over a long averaging period of several weeks. The graph shows that using echo feedback improves the phase errors, especially for small averaging periods. The values of σ x (τ) are always below the values for uncorrected PPS pulses. Please note that both axes are scaled logarithmically by convention to make it possible to identify the different noise processes. Therefore, the effect looks smaller at the first glance than it really is. The graph also shows that a careful selection of the averaging interval τ is crucial to the accuracy of the system. Our measurements imply an optimum value of 32 seconds, but a standard NTP daemon bases the choice of τ on the Allan intercept point, the minimum of the Allan variance. The averaging interval used by NTP 68

69 5.2 Offline Synchronization is larger than the optimal choice determined in our setup, since standard NTP and hardpps implementations impose a lower limit of 1024 seconds on τ. This limitation lead to the idea to implement an own synchronization system that is tailored to the specific needs in our laboratory environment. 5.2 Offline Synchronization To obtain an optimal synchronization, we developed a solution that uses the TSC of the CPU to timestamp both the events and the PPS pulses and leaving the system clocks completely unsynchronized. The synchronization is done offline after the measurements took place. As explained in section 4.1, the system clock is not based on a free-running oscillator in most operating systems, but based on a combination of a number of clock sources (e.g. interrupt controller and cycle counter). This leads to an addition of the individual noise processes of the different oscillators and makes it harder to synchronize the clock to an external reference and to determine optimal parameters for a synchronization system. This can be completely avoided by using only a single free running high frequency oscillator. In this case we use the cycle counter that is triggered by the internal CPU clock. Another advantage of using the cycle counter for timestamping is shown in figure 4.5: It can be read fast and no context switch is needed. Therefore, the measurement process influences the performance of the object system less than when using the local clock for generating timestamps. Figure 5.6 illustrates the measurement and offline synchronization method. Before the measurement starts, a reference point in time is marked with a TSC timestamp of the object system. This reference timestamp is used to generate absolute timing references. During the measurement, both the events and the PPS pulses arriving at the object system are timestamped using the object system s cycle counter and these timestamps are recorded in trace files. After the measurement took place, the time trace of the PPS pulses can be used together with the initialization information to calculate a synchronized event trace that contains absolute time points for all entries of the original event trace. Frank Fischer [24] implemented a solution using an exponentially weighted moving average algorithm with weights chosen depending on the optimum averaging period. He showed that the mean accuracy of the synchronization achievable is 69

70 5 Dedicated Measurement Infrastructure Figure 5.6: Offline Synchronization 603 ns which is very close to the specified accuracy of the PPS signal used in the setup (500 ns). The solution is based on the lockclock algorithm [46]. Assume the time trace contains a number of readings of the local clock t k and the corresponding time t R,k of a reference clock. The time offset x k of the local clock with respect to the reference time is given by x k = t k t R,k. When τ k = t R,k t R,k 1 defines the current difference of the timestamps as measured by the reference clock, the current frequency error y k can be estimated as y k = x k x k 1 τ k. The lockclock algorithm tries to estimate the current time offset ˆx k from the filtered previous frequency error estimation ȳ k 1 as ˆx k = x k 1 + ȳ k 1 τ k, 70

71 5.2 Offline Synchronization where ȳ k = ȳk 1 + Gy k 1 + G with a weighting factor G that is determined by the characteristics of the local clock. Defining α = G 1 + G allows to write the equation above as ȳ k = αy k + (1 α)ȳ k 1, which is an exponentially weighted moving average of the calculated y i with a weighting factor α. To apply this algorithm to our situation where PPS pulses are used as the main synchronization source and the time between successive reference timestamps is exactly τ 0 = 1 s, a sensible weighting factor G has to be determined. Levine reasons in [46] that G depends on the characteristics of the free running clock. He suggests to determine the measurement interval T nw at which the frequency fluctuations of the free running clock begin to deviate from a white spectrum, as white frequency noise leads to best predictions by the algorithm. This can be done by finding the point in a plot of the logarithm of the Allan deviation σ y versus the logarithm of the measurement interval τ where the slope changes from 0.5 to 0. An optimized weighting factor G should then be selected so that G τ 0 T nw. The implementation of this algorithm was done as a Java application. The program uses a text file that contains TSC stamps for PPS pulses plus an initialization text file that contains a wall clock time (date and time) for one specific PPS pulse to generate an event trace with wall clock times and event identifiers from a trace file with TSC stamps for events and the corresponding event identifiers. Since the events are synchronized to an external clock that provides the PPS pulses, the solution can be applied to an arbitrary number of event traces of different systems. The resulting synchronized traces can then be compared to each other and e.g. used to determine one-way delays. 71

72 5 Dedicated Measurement Infrastructure This approach is also applicable in small embedded systems where an online synchronization would be too time consuming [39]. When using configurable hardware, it is even possible to latch the current cycle counter (TSC) reading in hardware at every PPS pulse. This latched cycle counter can be read in the interrupt service routine for the PPS pulse to be used in an offline synchronization process, as this completely avoids the negative impact of the interrupt latency. The system has not only been used for the web cluster, it has also been applied to measure one-way delays for wireless IEEE b transmissions. For this purpose, laptop computers were equipped with PCMCIA WLAN cards. A PPS pulse from a GPS receiver was delivered to the parallel ports of the computers to record the PPS TSC trace. The whole measurement process has been implemented as an integrated system by Christian Resch [73, 72]. The measurement results were then used to calibrate existing WLAN models in the simulation package ns-2. Johannes Dodenhoff [20] evaluated an implementation of the NTP algorithms with a VFO and a PLL/FLL for offline analysis of timestamps. He was unable to produce results that were better than the results of the offline synchronization process of Frank Fischer. Using the PLL approach with a standard NTP parameter set, the 40 minute cycle still remained clearly visible as an oscillation in the time offset. Using optimized parameters (τ = 1024 s), the height of the amplitude of this oscillation was reduced to about 5 µs. He obtained the best results applying an FLL correction after the trace has already been modified using the PLL approach. The main result was that the quality of the synchronization depends strongly on the set of chosen parameters like the bandwidth τ of the PLL. When the parameters are not carefully chosen, the timestamps begin to oscillate and diverge more and more from the reference clock. The experiments with recorded time traces showed that the offline synchronization implemented by Frank Fischer provided sufficiently good results and that a PLL or FLL approach needs to be tailored to the specific characteristics of the clocks. For a future improvement of the offline time synchronization process, it seems worthwhile to evaluate the performance of a Kalman filter based approach. 72

73 5.3 Instrumentation 5.3 Instrumentation For a software monitoring solution, it is necessary to instrument the code of the software that is executed on the object system. Instrumentation is done by inserting instructions into the code that generate timestamps for events and recording those together with an event identifier in an event trace. Events can also mark the beginning and the end of an activity. Therefore it is possible to calculate durations of activities that influence the progress of a certain task. Since all the software components used in our laboratory are open source applications, it has been possible to include instrumentation code in the source code and to recompile all necessary components IP Stack Instrumentation Since a primary goal was to include fine-grained measurements of one-way delays for IP packets, it has been necessary to instrument the TCP/IP stack of the operating system kernel. By doing so, it is possible to generate timestamps as soon as the operating system recognizes an IP packet. The closer to the hardware the timestamp is generated, the fewer influences of other tasks disturb the measurements of the packet delays. This proved as feasible, as we use Linux as the operating system, which enabled us to include own code statements in the kernel and to recompile and use our own custom kernel. Starting from versions 2.4, the Linux TCP/IP stack contains the netfilter framework [65] for packet filtering and mangling. It provides hooks at several places in the kernel stack where own code blocks can be registered. These blocks are executed when an incoming or outgoing IP packet is processed by the kernel stack. Our first solution for Linux versions 2.4 was implemented by Andrey Chepurko [15] and consisted of timestamping code that was registered at a hook (e.g. NF_IP_LOCAL_OUT on the real server nodes) of the netfilter framework and a kernel space ring buffer for the recorded timestamps and event identifiers. The event identifiers included the source IP address, the TCP source port and the TCP sequence number for incoming packets from the client. For outgoing packets, destination data has been used instead of source data. Another field of the event ID included the TCP flags SYN, ACK, FIN and the direction of the packet (incoming or outgoing). Special care had to be taken on the load balancing node, since 73

74 5 Dedicated Measurement Infrastructure both packets sent from the client and destined to the client pass this node twice. Therefore, each packet has to generate two different events, one when entering the node and one when leaving the node. Since our load balancing solution is implemented as a kernel module that adds load balancing functionality to the IP stack, this code was extended to also include timestamping and event recording. The timestamps have been generated using do_clock_gettime() call. Therefore, we have been able to obtain timestamps with nanosecond resolution when using the PPS API kernel patch. With this 64 bit timestamps, one entry has occupied 19 bytes of buffer space. The size of the kernel event trace buffer can be configured, the standard size has been 12 MByte. This buffer has been organized as a ring buffer. That means that when the buffer runs full, the measurement continues and overwrites the oldest entries successively. The buffer can be read from user space processes via a character device. The recorded event trace is copied to the user mode in a raw binary form. IOCTL commands to reset and clear the buffer have been implemented. A user mode process can read from the device both during and after the measurement and write the trace to a file in binary form. An additional program has been implemented to convert the entries of the binary file to a text file for further evaluations. While this first solution was sufficient for experiments in the web cluster laboratory, it had a number of limitations. The most important problem was that it has only been implemented for Linux kernel version 2.4. Furthermore, even if it is a loadable module, it has been static in many aspects: Most options like the size of the kernel ring buffer can only be changed at compile time, not when loading the module. What limited its use for other fields of applications was the fact that both the packets that are captured and the header fields that are included in the event trace have been fixed and tailored for the specific needs of web traffic. This has been done to limit the size needed for an event entry and to increase the speed of the event recording. But as main memory grows and since a flexible instrumented IP stack showed to be useful for other applications, we decided to re-implement the netfilter-based capture solution for Linux versions 2.6 in a more flexible way. Mario Lasch [42] created a flexible packet logging solution for Linux kernel versions 2.4 and provided a base for porting his solution to kernel versions 2.6. His implementation kept the basic realization of the in-kernel ring buffer and the character device for communication with user mode processes. His logging module can be attached to any netfilter hook and allows to specify which packets to capture depending on their protocol (TCP, UDP or ICMP) and destination port. In contrast 74

75 5.3 Instrumentation to the previous implementation, the timestamps can not only be generated using the kernel clock when a packet is received, but also using the TSC or by reading the timestamp that is generated by some network interface card drivers. This brings the timestamping closer to the hardware and thus provides more precise results. Another improvement is that the complete IP and transport layer headers can be recorded. The size of the kernel ring buffer can be specified when loading the module without recompilation. Nonetheless, a change of the netfilter hook still requires a recompilation of the module. Figure 5.7: IP Stack Instrumentation In a further thesis [43], Mario Lasch implemented a solution based on his prior work. The extended module is usable both for Linux kernel versions 2.4 and 2.6. It is completely integrated into the netfilter framework. Besides being attachable to any netfilter hook without recompilation, the kernel module can also be used as an iptables target. The netfilter framework provides the possibility to implement rule sets for packet filtering. The rules are composed of a number of classifiers (iptables matches) and one connected action (iptables target). The user mode command iptables can be used to insert a new match in a netfilter kernel table and defines 75

76 5 Dedicated Measurement Infrastructure the target for matching packets. The following command inserts a rule that drops all incoming TCP packets in which the SYN and ACK bits are set, i.e. sends them to the target DROP that discards the packets: # iptables -A INPUT -p tcp --tcp-flags ALL SYN,ACK -j DROP The new logging solution provides an iptables target named RBUFF. Timestamps are created for all packets that are sent to this target, and the timestamp and corresponding packet header data are recorded in the kernel ring buffer as an event entry. After loading the new module called ipt_rbuff.ko, the following command can be used to trigger the recording of all incoming TCP packets with destination port 80 in the event trace: # iptables -A INPUT -p tcp --dport 80 -j RBUFF In contrast to the DROP target, packets sent to the RBUFF target are not discarded, but remain in the kernel to be processed by other rules and to be finally copied to user mode applications. Additional to the hook and target mode, the module can also be used in a dual mode where the packet is recorded both in the first and the last hook of the IP stack (PRE_ROUTING and POST_ROUTING hook). This mode is needed for measuring the time spent in the stack which is useful in gateway nodes like the load balancer of the web cluster. The use of multi-core architectures and the kernel preemption in Linux version 2.6 made it necessary to include protection for critical sections in the new version of the module. This new kernel versions provided the base for some further improvements. The module can now be monitored and controlled using the sys filesystem (sysfs). For example, when using sysfs, the following command can be used to stop the recording of events in the kernel buffer: # cat "0" > /sys/bus/platform/drivers/rbuff_driver/record Since kernel versions 2.6 provide advanced solutions for transferring data from the kernel to user space applications (e.g. the relay subsystem), the character device implementation of the module has also been analyzed and optimized. Likewise, the module now also supports udev, an approach to create device nodes in the /dev tree automatically. Additionally, Mario Lasch implemented a new user mode tool to read the event entries from the kernel ring buffer. Additional to the export of the entries into a text file where the fields are separated by spaces or commas, the entries can also be exported in libpcap format. This format is used by the 76

77 5.3 Instrumentation tcpdump command and allows network protocol analyzers like wireshark [16] to read the recorded trace. The filtering and display capabilities of wireshark are useful especially when examining and debugging new configurations. The architecture of the current IP stack instrumentation is depicted in figure Web Server Instrumentation In addition to the kernel level timestamping of IP packets, an instrumentation of the code of the web server application is needed to obtain data for modeling and performance evaluation. When static web pages are served, an instrumentation of the Apache web server has been used to generate application level timestamps. Apache s C API provided a base to implement handlers in certain stages of the processing of an incoming HTTP request. The first handler is the post-read request handler. This handler is called when an arriving request has been fully read by the server application. We implemented an external Apache module that registers a handler in this place to write a timestamp along with the client IP address, the client port and the URI of the request to an event trace file. The event for the completion of the request is triggered when the ap_send_http_header() method is invoked. This point in time marks the begin of sending the HTTP reply back to the client Load Generator Instrumentation On the client side, an HTTP load generator is needed. httperf, a load generator developed by David Mosberger [64], is able to generate load with different characteristics: sequential requests, requests with a fixed rate, session-oriented traffic with think times and requests according to a recorded trace file. It has the potential to overload a web server by generating a high request rate. Unlike most other load generators, it does not try to simulate a certain number of users. The number of requests and the rate of the requests can be specified on the command line. Since the test client PC has certain limitations on the number of TCP connections that can held open simultaneously, httperf supports parallel execution on a number of client machines. This load generator is ideal for studying the behavior of a web server in extreme situations for finding the system s limits. 77

78 5 Dedicated Measurement Infrastructure SURGE [6] on the other hand emulates the behavior of a configurable number of users as observed by analyzing the log files of web servers by its author Paul Barford. For this purpose, the relative percentage of the number of accesses per file, embedded references, temporal locality of references and inactive periods of the user are determined by an analytical model derived from empirical observations. Some of the probability density functions used in the model are heavy-tailed. The on-off processes which are are used to model the user generate bursts and selfsimilar traffic as observed in recent studies about real-world traffic on the Internet. It is also usable in a distributed environment with more than one load generating node. We use both load generators and instrumented their request generation phase for timestamping each request on the HTTP layer. Additional to the instrumentation of the load generating software, the IP stack of the load generator nodes has also been instrumented with the solution from section Application Server Instrumentation While static content is still common on most web server systems, generating content dynamically becomes more and more important in the web. In its most basic form, data in an internal representation is transformed to another output format for presentation to the client. This allows to separate the layout of the web pages from the content. One approach to achieve this is the representation of the content as XML and the use of XSL for transformation to an output format like HTML. We evaluated the use of XSLT in our web cluster laboratory in [96]. This dynamic generation of content can be combined with a content management system (CMS) as described in [23]. A content management system often uses a combination of a database for storing the content and an application server for the transformation. Even more dynamic behavior can be expected in an web shop application. Therefore, Markus Preißner [70] implemented an online bookstore according to the TPC-W benchmark [83] for use in our lab. He used Enterprise Java Beans (EJB) for the business logic running on an JBoss application server. The database backend was a MySQL server. His implementation was intended for the use on the web cluster and allowed to distribute the application server and database functionality to a different number of nodes, depending on the configuration used. The only limitation in doing so was that write access was only allowed on one of the multiple database 78

79 5.3 Instrumentation nodes. For a performance evaluation of a system of this kind, it is necessary to instrument the different stages of the request handling and reply generation on the application server. Patrick Wunderlich [97] implemented an instrumentation for the TPC-W web shop system using aspect-oriented programming (AOP). AOP allows to define an instrumentation aspect, where parts of code that are used for the instrumentation can be held separate from the business logic. These code parts are called advices. Joinpoints are events in the program execution like the invocation of a method. A pointcut allows to select specific joinpoints and to assign an advice to it. An advice can then be executed before, after or instead of a method. The process of weaving constructs the complete software system using the core logic and inserting the advices of different aspects according to the defined pointcuts. When using Java as in our case, the weaving of the instrumentation aspect can be done in three different ways: A precompiler can be used to insert the source code of the advices into the source code of the program to be instrumented. The resulting code can then be compiled to bytecode using a standard Java compiler and be executed using a standard virtual machine (JVM). An AOP compiler can be used to insert the advices into the compiled bytecode of the core logic to produce instrumented bytecode that can be executed like the uninstrumented code. A special class loader can be used during the execution of the uninstrumented bytecode of the core logic to to insert the code of the advices. Patrick Wunderlich used AspectWerkz for his instrumentation. Aspects are implemented as pure Java. The pointcuts can be defined using annotations in Java 5, custom doclets in Java 1.3 and 1.4 or an external XML definition. The joinpoints definitions can contain wildcards that match specific classes, methods, constructors or fields. Pointcut definitions can combined using logical expressions like NOT, OR, AND and can be grouped using parentheses. This allows a flexible selection of the method invocations to be included in the performance evaluation. As the measurement code is inserted by weaving, the original web shop source code remains completely unmodified. Advices were introduced not only into the web shop software on the JBoss application server, but also into the the Tomcat servlet container and the Clustered-JDBC database middleware. This middleware component was also utilized by Patrick Wunderlich to mitigate some limitations of the original implementation of Markus Preißner when it is used in a clustered 79

80 5 Dedicated Measurement Infrastructure environment. The event traces of all components on all nodes are sent to a central instrumentation server. This server is also implemented in Java and based on the Tomcat servlet container. It collects the traces and allows the user to filter interesting events using a web frontend. For statistical analyses, the relevant data can then be exported to text files. Figure 5.8 shows the overall architecture of the instrumentation present on each cluster node. Figure 5.8: Application Server Instrumentation Architecture The solution not only provided valuable data for a performance analysis of the system, but also allowed to gain insight into to dynamics of the implementation and to identify optimization points. While this method of instrumentation has been developed for a system where all source code is available, it can also be used for closed source systems. As an example, Patrick Wunderlich explained in his thesis how the database interaction can be observed by instrumenting the JDBC driver. Since each driver has to implement the interfaces of the package java.sql, it is obvious which pointcut definitions have to be used. Stefan Schreieck showed the applicability of this approach to a self-service web portal of the University of Applied Sciences in Kempten. The system uses commercial Java class files for which no source code is available. Therefore, Stefan Schreieck used the method we suggested and created an instrumented JDBC driver [75, 76] 80

81 5.3 Instrumentation for the Informix database to obtain performance data of the system that can be used to parametrize models of the system like the ones presented in [29]. A similar instrumentation was used by Olena Kolisnichenko [40] for the web portal of DATEV eg. DATEV is a Nuremberg based association for tax counselors, auditors and attorneys. Their online portal builds the gateway to different online services provided to their customers. It is a Java 2 Enterprise Edition application that has been programmed by an in-house development department. To identify performance-critical parts in the program execution and to avoid possible problems, a measurement method that can be used during the implementation and testing phases before deployment had to be found. One goal was to keep the instrumentation separate from the core business logic and to provide a solution that allows to easily add and remove the instrumentation. The performance impact introduced by the instrumentation should be kept to a minimum. Previous experiments have shown that traditional profiling was not flexible enough for the automation of the performance tests and had a major impact on the overall performance. Our aspect oriented approach that has been ported to the DATEV application by Olena Kolisnichenko proved to fulfill the needs of the development department and will be used during future tests of new implementations Summary Performance Data In addition to the fine-grained performance data obtained by event-oriented software monitoring, summary performance data are often useful to validate the outcome of simulation runs. In a detailed performance analysis, resources are often modeled as a separate entity. The utilization of these resources is not only caused by the application to be evaluated, but other system activities also use the same resources. Therefore, tools to measure summary data like resource utilization are a sensible addition. We used the sar and iostat tools that are part of the sysstat utilities [26]. In the cluster lab we used sar to sample the utilization of the CPU, the memory usage and the network load over intervals of one second length. Especially the CPU utilization has proved to be an important indicator for the correctness of the model, because it allowed us to compare simulated and measured CPU data. When building a simple (conceptual) model, such summary performance data can be sufficient for parametrization. 81

82 5 Dedicated Measurement Infrastructure 5.4 Analysis of the Traces While some of the performance data obtained by conducting measurements with our instrumentation can be used directly for the input modeling, the event traces, especially from the instrumented IP stack, have to be processed to be usable. As described in chapter 4, the event trace produced during event-oriented performance evaluation contains timestamps for events. The events themselves have no duration. The one-way delays that are represented in performance models are activities. The start and the end of each activity is marked by an event. The duration of the activity and thus the delay can be calculated as the difference of the corresponding timestamps. In a distributed system, the start and the end are often recorded on different nodes of the object system in different trace files. As the timestamps are obtained from the local clocks of the object system, the start and the end of an activity are often measured with different clocks. Since the calculated delays are used for determining a distribution function in the input modeling process, the local clocks of the machines need to be synchronized with high accuracy. Low phase jitter in time synchronization is a crucial point. As described before, this can be achieved by synchronization during the measurements or by post-processing the event traces using offline synchronization (section 5.2). Once the event traces of the IP stack with synchronized timestamps have been obtained from all nodes of the object system, the event identifiers can be used to reconstruct the way of all TCP packets through the nodes of the system. We wrote a Java application that uses the TCP sequence numbers, flags, ports and IP addresses to identify and calculate the delays each segment of a TCP connection experiences in different places. For example, this allows to determine the delay between the reception of an HTTP request and the sending of the first reply packet on a web server node or the delay in network channels. The end-to-end delay on the application level between a client and a server node can be evaluated by looking at the application layer traces of the respective nodes. For a web server system like the one in our lab, this means relating the load generator and the web server traces to each other and requires matching the request URL, the client IP address and the client port. During the implementation of the detailed simulation model, it became obvious that the model can produce event traces that contain exactly the same information as those obtained during the measurements. In the simulation model, the dynamics 82

83 5.5 Example Measurement Results of the system are often represented as state charts. When a change of the state of a model can be seen in the output of the system, the captured output of a real system can be used in combination with the model of this system to parametrize the model automatically. So far, we have not implemented this approach. For model-based performance testing, we have identified some aspects that can be handled using this method [7]. 5.5 Example Measurement Results For these example measurements, five real server nodes and one load balancer were used in a NAT environment with round-robin scheduling. One test client generated HTTP requests using the httperf load generator. We have generated 10,000 HTTP/1.0 requests for a binary file with a size of 1,024 bytes. This resulted in a request size of 65 bytes. The web server added 244 bytes of header information, so the resulting replies had a size of 1,268 bytes. Since this is smaller than the maximum segment size used (1,500 bytes), all replies consisted of exactly one TCP segment. Figure 5.9 provides an illustration for the 27 individual delays in the exchange of TCP segments that contribute to the total processing time of the HTTP request. Time advances along the vertical axis from the top to the bottom and the vertical bars represent the different delays. The horizontal position shows where the delays are caused: Either by the load generator (LG), the network channel between the load generator and the load balancer (C1), the load balancer (LB), the network between the load balancer and the real servers (C2) or by one of the real servers (RS). The delays in the channels C1 and C2 include not only the physical propagation delay and the processing delay in the switches between the hosts, but also to time between the reception of the packet at the node of the cluster and the beginning of the packet processing in the TCP/IP stack of the operating system. The segments that appear during delay 11, 12 and 16 are sent due to TCP protocol mechanisms (Early ACK) and do not mark a state change in the HTTP protocol state machine. The measurements were conducted in a low load situation. That means that queuing was not an issue here. One reason for doing so was that this leads to lower delays for 83

84 5 Dedicated Measurement Infrastructure each packet and separates the different delays so that the activities do not overlap and influence each other. Another reason for this was that the load generator is not powerful enough to bring the cluster system in an overload situation. Even if both the measurement infrastructure and the load generation software itself allows to use more than one load generator, this would lead to more complicated traces that are both hard to analyze and to visualize. All data shown in this section was obtained using our instrumentation of the IP stack. The ring buffers were configured to be large enough to hold all captured data so that the user mode reading application could be started as soon as the measurement was over. Therefore, it did not influence or disturb the measurements. Time synchronization was achieved using the PPS signals of a GPS receiver connected to all cluster nodes and the load generator as shown in the chapter 5. A trace plot of all delays can be seen in figure All delays are plotted over the time of their measurement in one single graph. Since this graph provides only a comparison of the different orders of magnitude of the different delays and of the complex nature of the processes, figure 5.11 shows trace plots of the individual delay. The vertical axes have been limited to the 99.5% quantile of the respective delay. The maximum values are not included, because outlier values can become excessively high. The main part of the observed delays would therefore been reduced to a single line due to the scaling of the axes. Figure 5.12 show the 27 different delays for the first 50 request-reply pairs from the trace. The delays are displayed as stacked horizontal bars from the left side to the right. The colors of the bars correspond to the color of the delays in figure 5.9. Therefore, the bars on the left show delay 1 while the rightmost bars represent delay 27. Overall summary statistics are plotted in figure The horizontal stacked bars show the minimum, the 0.5% quantile, the first quartile, the mean value, the median, the third quartile, the 99.5% quantile and the maximum value for all 27 delays observed. The fact that 0.5% of the values are much larger than the rest can be easily seen. This indicates that a few measurements are disturbed by undesirable side effects. It also justifies to discard these values in further analyses of the system. Since the delays differ by three orders of magnitude, some of the delays cannot be seen in this plot. Table 5.1 summarizes the same statistics in numerical form. All values are given in microseconds. 84

85 5.5 Example Measurement Results Table 5.1: Quantile Summary for Delays in Microseconds Min. 0.5% 25% Median 75% 99.5% Max. Mean Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay

86 5 Dedicated Measurement Infrastructure Figure 5.9: Illustration of Delays in the Object System 86

87 5.5 Example Measurement Results Figure 5.10: Trace Plot of Measured Delays 87

88 5 Dedicated Measurement Infrastructure Figure 5.11: Trace Plots of Individual Delays 88

89 5.5 Example Measurement Results Delay [µs] Delay Components Figure 5.12: Delay Components for Requests 89

90 5 Dedicated Measurement Infrastructure Summary Statistics Max. 99.5% Q. 75% Q. Median Mean 25% Q. 0.5% Q. Min Delay [µs] Figure 5.13: Summary Statistics the Delays 90

91 6 Advanced Input Modeling To employ the measured delays in a performance study of the system, the statistical parameters of the data have to be determined. In the input modeling phase, the representation of the real-world data in the model must be determined. Two different approaches are applicable, trace driven performance evaluation and the use of distribution functions. In trace driven modeling, a recorded trace is fed into the model from which all events are generated in exactly the same order and temporal distance as observed at the real system. While it is extremely useful for checking if the simulation results are valid and comparable to real world data, the amount of recorded data is limited in most situations. Therefore, once the trace has been consumed by the model, the only solution is to repeat the process and continue with the start of the trace again. This produces correlated data that are not independent. Therefore, care has to be taken when analyzing the results of the performance study. Furthermore, this approach can only be chosen if a real setup of the modeled system is available. This is not always the case, because performance evaluations of several different architectures are often conducted before building a setup and have the aim to decide which design alternatives to implement. The use of distribution functions allows to generate an infinite number of independent random variates. This approach in its basic form is only valid if the measured data from which the distribution function is to be determined are independent. Nonetheless, different solution how to deal with correlated input data will be presented in the next sections. For uncorrelated input data, the first step is to determine the distribution (or probability mass) function of the measured data. Once this function has been found, the empirical function can be used directly in the model or a fitting theoretical function can be determined. Figure 6.1 shows histograms of the 27 measured delays from the previous chapter. Each of the delays can be used in a performance model. The theoretical represen- 91

92 6 Advanced Input Modeling tations we utilized were implemented in separate simulation models to validate if they are able to represent the measured values. This was done by generating several thousand sample points and analyzing them with quantile comparisons and a variety of plots like histograms, trace plots, scatter diagrams and correlation plots. In addition to the implementation of a detailed simulation model, Isabel Wagner implemented and improved the input modeling in two theses [87, 88]. Preliminary results of the modeling have been published in [89]. 6.1 Traces and Empirical Distributions As a first step, the behavior of a performance model, especially when building a discrete event simulation, can be evaluated using a trace driven approach. Once the structure of the model is implemented, the necessary random variates that influence this behavior are directly taken from a recorded trace file. When we have built a model of our IP stack, we were able use the example measurement results from section 5.5 to drive the model directly. For this purpose, trace files of the 27 delays have been prepared. For each delay occurring in the model, the next delay from the trace file has been used as an input to the model. Once the end of a trace had been reached, the process has been continued from the beginning of the file again. This allowed to check if the model structure represented the measured system, for example by comparing the time from the sending of the first SYN segment from the client until the reception of the last ACK segment of each TCP connection by a server in the model to the measured values. Some structural model errors can be seen using this validation process. This simple approach is only feasible when the model represents the setup of the real system. It is obvious that it cannot be used for configurations for which no laboratory setup exists and therefore no measurements are available. It also does not allow to easily change the load that is imposed on the system. Assuming independence of the measured data, empirical distribution functions can be built as the second step in the input modeling process to generate random values from. Since all measured data are available in our case, we can sort the n observations X i, i {1,... n} in increasing order with X (1) X (2) X (n). A 92

93 6.1 Traces and Empirical Distributions Delay 1 Delay 2 Delay 3 Density Density Density Delay [µs] Delay [µs] Delay [µs] Density Delay Density Delay Density Delay Delay [µs] Delay [µs] Delay [µs] Density Delay Density Delay Density Delay Delay [µs] Delay [µs] Delay [µs] Density Delay Density Delay Density Delay Delay [µs] Delay [µs] Delay [µs] Density Delay Density Delay Density Delay Delay [µs] Delay [µs] Delay [µs] Density Delay Density Delay Density Delay Delay [µs] Delay [µs] Delay [µs] Delay 19 Delay 20 Delay 21 Density Density Density Delay [µs] Delay [µs] Delay [µs] Delay 22 Delay 23 Delay 24 Density Density Density Delay [µs] Delay [µs] Delay [µs] Delay 25 Delay 26 Delay 27 Density Density Density Delay [µs] Delay [µs] Delay [µs] Figure 6.1: Histograms of Observed Delays 93

94 6 Advanced Input Modeling continuous piecewise-linear distribution function F(x) can then be defined as 0 if x < X (1) i 1 F(x) = if X (i) x < X (i+1) for i = 1, 2,..., n 1 n 1 + x X (i) (n 1)(X (i+1) X (i) ) 1 if X (n) x. When only grouped data summaries like histograms are available, a different approach has to be taken to construct an approximate empirical distribution function. More details of both methods can be found in [45]. While empirical distribution functions represent the statistical properties of the measured data to some extend, their application has some limitations. When sorting the data, it is obvious that any correlation structure that might be present is lost. Furthermore, the mean of the sampled values X can differ from the mean of the distribution function F(x) due to the piecewise-linear interpolation. Another limitation of empirical distribution functions is visible when looking at the definition of the function: When generating random values, the lowest value that can be generated is the lowest value in the measurement and the largest generated value is the largest measured value. As this is not always desired, there exist approaches to combine an empirical distribution function with a theoretical distribution function where this is not the case, for example an exponential distribution. For that reasons, empirical distribution functions are useful in early stages of the model design for validation and first experiments, but when the behavior of different configurations of the system under various parameter settings is to be predicted, using theoretical distribution functions overcomes these limitations. 6.2 Outlier Values The measured data often contain values that are not caused by system behavior but are either too small or too large because of errors caused by the measurement itself or by other undesired system activity. To isolate these effects, these values, called outlier values, have to be eliminated from the traces before distribution fitting. In [94], Winkler describes a method to analyze the statistical properties of the data and to classify the values either as valid or outliers. For values distributed according to a normal distribution, he suggests to remove any value outside of the 94

95 6.3 Autocorrelation interval [µ 4σ, µ + 4σ], where µ is the sample mean and σ the standard deviation, resulting in a significance level of For other distributions, the same approach can be applied, but here the sample median is a good estimator for µ, whereas σ can be estimated as the median of the absolute deviation of the sampled values, as these estimators are insensitive to the magnitude of extreme values and outliers. Winkler states that in typical cases less than 1% of the measured data are removed using his method. We implemented the algorithm in the statistical computing environment R [71] to automate the process of outlier removal. 6.3 Autocorrelation Some of the mathematical methods used when fitting a distribution to the recorded data are only valid when the observations are independent. This is especially true for the maximum-likelihood estimation and chi-square tests that are used for the parameter estimation once a family of distributions has been selected. There are different techniques to assess the independence of the values. As an example, correlation plots for the 27 delays are shown in figure 6.2. The sample correlation ρ j of the observations X 1, X 2,..., X n is defined as ˆρ j = Ĉ j S 2 (n), with Ĉ j = n j i=1 [X i X(n)][X i+j X(n)], n j S 2 (n) = n i=1 [X i X(n)] 2 (sample variance), n 1 X(n) = n i=1 X i n (sample mean). The correlation is plotted for a varying values of the lag j {1, 2,..., l}. This ˆρ j is an estimate for the true autocorrelation ρ j of two observations that are j samples apart in time. If all samples were independent, ρ j = 0 for all j {1, 2,..., n 1}, but since the samples are observations of a random variable, the estimator ˆρ j will not be exactly 0 for all j, but a significant difference from 0 indicates dependence of the observations. Another graphical method to assess the independence of the samples is the scatter diagram. It displays points with coordinates (X i, X i+1 ) for pairs of successive samples. When the observations are independent, the points are scattered randomly 95

96 6 Advanced Input Modeling ACF ACF ACF ACF ACF ACF ACF ACF ACF Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag ACF ACF ACF ACF ACF ACF ACF ACF ACF Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag ACF ACF ACF ACF ACF ACF ACF ACF ACF Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Figure 6.2: Correlation Plots (lag 500) 96

97 6.4 Standard Theoretical Distributions in the plain. If they are dependent, they tend to be located along a line in the plane. Examples of these plots will appear in the distribution fitting sections. Autocorrelation is often caused by queuing or buffering in system components. When network packets are processed by a single server with a FIFO queue and the server is busy while a packet arrives, the packet has to wait until the previous packets have been processed. The longer the processing of the previous packet takes, the longer the successive packets have to wait. This can be seen as a positive autocorrelation. Delays caused by network transmissions like number 9, 11, 13 and 15 show this typical behavior due to buffering in both switches and the network interfaces of the system. Another interesting effect can be seen from a correlation plot, especially when looking at smaller lags like depicted in figure 6.3. The autocorrelation is very high for lags that are integer multiples of five for the delays 8, 16 and 24. This indicates that even if the hardware and software of all real server nodes are identical, packets sent to or received from different real servers experience different delays, especially in the load generator node. In figure 6.4, a trace plot sorted by the real server node that is involved in the communication shows that these delays are indeed different for real server node 4. We made some further experiments, but were not able to find any systematic reason for that behavior. A critical point is how to deal with the correlation. As mentioned before, some mathematical estimators are not valid for correlated data. But most graphical methods for determining the goodness of a fitted distribution are still applicable. Depending on the model structure, correlation can have an effect on the results of a performance evaluation. Therefore, either the model itself can induce a correlation (as it is the case when modeling buffers and queues explicitly) or the random variates must be generated so that they exhibit the same correlation as the measured data. 6.4 Standard Theoretical Distributions Once the outliers are removed, standard theoretical distributions are a good way to represent the data in the model. They offer the advantage that their parametrization can be changed. So they are not only useful to capture the current behavior, but they can also be modified to model the system under different workloads. This can often be done by changing one parameter of the distribution function like the mean value that is often the most important parameter to characterize a certain distribution. 97

98 6 Advanced Input Modeling ACF ACF ACF ACF ACF ACF ACF ACF ACF Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag ACF ACF ACF ACF ACF ACF ACF ACF ACF Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag ACF ACF ACF ACF ACF ACF ACF ACF ACF Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Delay Lag Figure 6.3: Correlation Plots (lag 40) 98

99 6.4 Standard Theoretical Distributions Figure 6.4: Trace Plots Sorted by Real Server 99

100 6 Advanced Input Modeling For example, the mean value of an exponential distribution can be changed using a different λ. Other distributions allow even more modifications, the normal distribution allows to change the mean µ and variance σ 2. For Weibull and gamma distributions, the shape of the density function can be changed additionally by modifying a shape parameter α. This allows to adapt these distributions to account for different situations to be predicted. However, it is not always possible to fit a standard distribution. Most standard distributions are monomodal, therefore, a good fit cannot be expected for multimodal data like delays 2, 3, 8 and 16. The distribution fitting tool ExpertFit automates the process of distribution fitting, parameter estimation and goodness-of-fit tests [44]. In a first attempt, we used ExpertFit to fit standard theoretical distributions to all 27 delays. As explained before, the goodness-of-fit tests resulted in a bad fit for most of the delays, which is not surprising, considering the shapes of the delays as depicted in figure 6.1. Nonetheless, we were able to achieve an acceptable fit for the seven delays shown in table 6.1. One problem when applying mathematical goodness-of-fit tests is that their results often indicate a bad fit when a high number of input samples is used. In contrast, the accuracy of the fitted distribution becomes higher the more input data are available. For that reason, we used the maximum of 8,000 samples that can be handled by ExpertFit, even when the goodness-of-fit tests show worse results in this case than the graphical methods for distribution comparison indicate. Table 6.1: Fitted Standard Theoretical Distributions Delay Fitted Distribution 1 Uniform 4 Lognormal 6 Log-Logistic 14 Log-Logistic 22 Lognormal 24 Pearson Type V 27 Lognormal Graphical comparisons of the measured data versus the fitted theoretical distribution are shown exemplarily in figure 6.5. The first row shows a trace plot, the histogram, a correlation plot and a scatter diagram of the samples (delay 22), whereas the second row depicts the same plots for the fitted distribution (lognormal). The 100

101 6.5 Multimodal Distributions Measurements: Delay 22 Delay [µs] Density Correlation Delay i Time [s] Delay [µs] Lag Delay i Fit: Delay 22 Delay [µs] Density Correlation Delay i Time [s] Delay [µs] Lag Delay i Figure 6.5: Distribution Comparison for Delay 22 first two columns of the plot indicate that the range and the general shape of the density function is a good approximation of the measurement. Furthermore, the correlation is negligible in this case, as indicated by columns three and four of this graph. 6.5 Multimodal Distributions Some of the measured data sets exhibit a multimodal distribution that is visible as two or more peaks in the corresponding histogram. A multimodal distribution is often caused by a mixture of data from different monomodal distributions. When fitting a distribution function, all monomodal distribution functions have to be handled separately. Since ExpertFit is unable to deal with multimodal data, we determined initial split points between the distributions visually by looking at the histograms. We were then able to fit a distribution to the individual data separated by these thresholds using ExpertFit. Once the monomodal distributions are known and their parameters are estimated, the weighting in the multimodal distribution as a weighted mixture of the individual distributions has to be determined. This can be done by determining the probability mass in each part as the relative frequency of the observations between the split points. Special attention has to be paid at the overlapping areas, as the different monomodal distributions will contribute a different amount of probability mass here. Therefore, the resulting multimodal distribution must be compared to the trace file and the split point has to be modified potentially in an iterative process. 101

102 6 Advanced Input Modeling When the resulting multimodal distribution is used in a performance model, one of the monomodal distributions is selected at random according to the weights of the distributions for each sample before the sample is generated from the selected distribution. This procedure generates independent random variates with the right density and weights the contributing components correctly. For this reason, the process can only be applied to independent input data that exhibit an autocorrelation near zero for all lags. A method for correlated samples will be presented in the next section. All distributions of our example for which we were successful in fitting a uncorrelated multimodal distribution are shown in table 6.2. All of these distributions are composed of two standard theoretical distributions and thus are bimodal. Table 6.2: Fitted Multimodal Distributions Delay Lower Distribution Upper Distribution Split Point 2 Log-Logistic Log-Laplace µs 3 Log-Logistic Log-Logistic µs 8 Inverted Weibull Log-Logistic µs 16 Inverted Weibull Johnson SB µs Figure 6.6 shows a comparison of the measured data of delay 3 with a distribution fitted using this method. According to the histogram, the weights and shapes of the monomodal distributions (both log-logistic) were chosen correctly. Additionally, the scatter diagrams indicate that the number of samples in each mode and the independent transition between the modes was modeled right. 6.6 Multimodal Distributions with Phases In the figure 6.1 the shape of the histogram for delay 19 looks clearly bimodal, but figure 6.2 indicates high autocorrelation. The reason for this becomes clear from the trace plot of this delay in figure 5.10: The values occur from one of the modes almost exclusively for a certain time. After this phase, almost only values from the second mode occur in the trace. Other delays also show this behavior. The length of the phases differs, but there seems to be a minimum length for each. 102

103 6.6 Multimodal Distributions with Phases Measurements: Delay 3 Delay [µs] Density Correlation Delay i Time [s] Delay [µs] Lag Delay i Fit: Delay 3 Delay [µs] Density Correlation Delay i Time [s] Delay [µs] Lag Delay i Figure 6.6: Distribution Comparison for Delay 3 The first steps in the distribution fitting are the same as in the independent multimodal case. The trace is split in individual modes and the relative frequency of each mode is determined. But in this case, the minimum length of each of the phases is also a needed parameter. Once these values are known, bimodal distributions with phases can be modeled as a finite state machine with two states. Samples are generated from the upper mode as long as the state machine is in the upper state. When a number of samples have been generated that corresponds to the minimum length of this phase, a state change to the lower state can happen. The probability of this state change is chosen according the relative frequency of the measured values in each of the modes. When a state change happens, the state machine is in the lower state and samples are generated from the lower mode. Again, a state change can happen according to the relative frequency of the modes once the number of generated samples reaches an integer multiple of the minimum length of this phase. Figure 6.7 shows an illustration of a state chart for this generation scheme as it is used in the simulation tool AnyLogic. The minimum phase lengths are identified by utime and ltime here. When this time has passed (i.e. the minimum required samples from the corresponding phase have been generated), a state change to the state denoted by state happens. In this state, another state change happens immediately. The direction of this state change, either back to the original state or to the other mode of the distribution, is chosen randomly according to the relative frequencies of the measured data denoted by ratio. By modeling the distribution in this way it is ensured that the probability mass is distributed correctly among the 103

104 6 Advanced Input Modeling Trigger: utime Action: prob = uniform() Trigger: ltime Action: prob = uniform() upper state lower prob < ratio prob >= ratio Figure 6.7: State Chart for Phase Transitions modes of the multimodal distribution even when the process of the state changes behaves different than in the real system. Table 6.3: Fitted Multimodal Distributions with Phases Delay Lower Distribution Upper Distribution Split Point 10 Bézier Bézier µs 11 Lognormal Log-Logistic µs 12 Johnson SB Log-Logistic µs 18 Bézier Bézier µs 19 Pearson Type VI Log-Logistic µs 20 Pearson Type V Inverted Weibull µs 26 Bézier Bézier µs The approach presented here is applicable for the delays in table 6.3. We were not able to fit standard distributions to all monomodal distributions of the phases. For some of them, we used Bézier curves as described in the next section. An exemplary comparison with the measured values is depicted in figure 6.8. The length and distribution of the phases show a good compliance of the synthetically generated data with the measurements. The autocorrelation is also captured in the model as indicated by the correlation plot. The scatter diagram shows one aspect of the samples that is not included in the models: the sporadic generation of values from the other mode of the distribution. This is of minor impact for most delays. For delay 26, where this happens more often, this generation from the other mode has been included in the model. The relative frequencies that are 104

105 6.7 Bézier Distributions Measurements: Delay 19 Delay [µs] Density Correlation Delay i Time [s] Delay [µs] Lag Delay i Fit: Delay 19 Delay [µs] Density Correlation Delay i Time [s] Delay [µs] Lag Delay i Figure 6.8: Distribution Comparison for Delay 19 used for transitions in the state machine had to be modified to maintain the right distribution of the probability mass. 6.7 Bézier Distributions When no theoretical distribution functions fits to the samples or the part of the samples that belong to one mode, Bézier distributions are an alternative approach. Classical Bézier curves are often used as an approximation of smooth univariate functions on a bounded interval in computer graphics. They have been adapted for the approximation of distribution functions by Wagner and Wilson [90, 91]. To apply this process on a set of sampled data X 1, X 2,..., X n that represent a continuous random variable X with a finite range [a, b], a set of control points {p 0, p 1,..., p m } has to be placed, where p i = (y i, z i ) for i {1, 2,..., m 1}, p 0 = (a, 0) and p m = (b, 1). A Bézier distribution function P(t) of degree m is given parametrically by P(t) = m i=0 B m,i (t)p i for t [0, 1], where the blending function B m,i (t) is the Bernstein polynomial m! i!(m i)! B m,i (t) = ti (1 t) m i for t [0, 1] and i {0, 1,..., m} 0 otherwise. 105

106 6 Advanced Input Modeling The resulting Bézier curve passes through the first and the last control point. Setting the control points p 0 and p m as noted above ensures that the resulting function will have the value of 0 on its lower endpoint a and 1 on the upper endpoint b. Wagner and Wilson show in [90] how to create a Bézier function that also fulfills the monotonically nondecreasing property of a distribution function. Figure 6.9: Screenshot of PRIME Their graphical tool PRIME can be used to fit these Bézier distributions to sets of sampled data. Several automated fitting methods based on optimization of the control point coordinates are available as well as the possibility to adjust the control points manually. The fitting process results in the coordinates of the control points. PRIME is limited to a maximum degree of the Bézier curve of 30. This is not an issue in our example measurement, but can become problematic for example when trying to fit an accurate distribution function to a data set with three or more peaks in the histogram. Figure 6.9 shows the empirical distribution function for the measured values of delay 18 and the fitted Bézier curve as it is used in [88]. 106

107 6.8 A New Model for Autocorrelated Data The possibility of using Bézier distribution functions is also proposed in [45]. Law and Kelton mention that these distributions are a good alternative to empirical distribution functions, but have the drawback that they are not included in most performance evaluation tools. This was also true for AnyLogic, the simulation environment we used to model the cluster system. Therefore, Isabel Wagner [88] implemented the random variate generation approach presented in [90]. Here, a sample from a Bézier distribution function that is given as a set of control points is generated using the method of inversion. The first step is to generate a random number U from the uniform distribution on [0, 1]. The first goal is to find a value t U so that U = m i=0 B m,i (t U )z i, i.e. to invert the function above numerically. In our implementation, we used a combination of two numerical root-finding algorithms. First, two approximate solutions are obtained using bisection with two runs of low order (4 and 5) [11]. The final solution is calculated using the secant method [11] with the results of the bisection as initial approximations. Using this t U, a random variate y(t U ) from the Bézier distribution can be generated as y(t U ) = m i=0 B m,i (t U )y i. As mentioned in the previous section (table 6.3), modes of the delays 10, 18 and 26 can be represented as Bézier distributions. The multimodal delays are generated from these using the phases approach. Figure 6.10 shows a comparison of synthetically generated random variates with measured data. Again, the scatter diagram shows that the sporadic generation from the other phase has been neglected. The other parameters of the distribution fit well, especially the representation of the probability density function shows the applicability of the Bézier approach. 6.8 A New Model for Autocorrelated Data Some of the measured delays feature both a high autocorrelation over relatively large lags and a clear upper and lower bound. This is the case for nearly all channel delays. When looking at the trace plots of these delays, a structure of rising or 107

108 6 Advanced Input Modeling Measurements: Delay 18 Delay [µs] Density Correlation Delay i Time [s] Delay [µs] Lag Delay i Fit: Delay 18 Delay [µs] Density Correlation Delay i Time [s] Delay [µs] Lag Delay i Figure 6.10: Distribution Comparison for Delay 18 falling bands is clearly visible. A reason for this can be found in the buffering in the network interface at the receiving side. To limit negative effects of frequent interrupt requests, modern network interfaces buffer received frames until either a reasonable amount of data has been collected in the buffer or no additional frame has been received for a certain time before issuing an interrupt request. The higher the bit rate of the medium, the more frames of a fixed size can be received per time unit. Therefore, the buffers are usually larger in Gigabit Ethernet interfaces than they are in Fast Ethernet interfaces. The frames can only be handled by the IP stack of the operating system after the interrupt has been handled by the driver and the content of the frame has been copied over the interconnecting bus. As the timestamping is done in the IP stack, the timestamps for packet reception will be close together for all packets that were transferred to the driver in the same interrupt handler invocation. The timestamp generated when sending these packets are also generated in the IP stack, but in this case, no buffering occurs before the generation. These effects are visible as bands in the trace plots and the resulting autocorrelation can affect performance studies of the system. To model this behavior of the network interfaces explicitly, it would be necessary to separate the delay components that contribute to the measured delays: the delay in the IP stack of the sender, the transmission delay, the propagation delay, the queuing and processing delays in switches or routers, the time spent in the buffer of the receiving network interface, the interrupt latency of the receiving node and the time in the IP stack of the receiver until the timestamping. Some of these times can be calculated analytically from the physical characteristics of the channel. For example, the propagation delay depends on the length of the interconnecting 108

109 6.8 A New Model for Autocorrelated Data medium and its velocity factor, whereas the transmission delay is a function of the frame length and the bit rate of the interconnection. But other factors like the queuing delay are statistically distributed and are influenced by other activity in the system. Measuring these delays individually is also no alternative, as this would require an enormous amount of hardware in different places inside the transmission path. Therefore, the most promising solution is to generate random variates for the overall delays that feature the same statistical properties as the measured values. Frequency Histogram of Deltas for Delay 5 1e+05 5e+04 0e+00 5e+04 1e+05 Delta 5 [ns] Figure 6.11: Histogram H o of the Deltas for Delay 5 Due to the high autocorrelation, we developed the idea not to generate samples d i of the delays, but to generate random variates for the difference of the current to the next delay value d i+1 as δ i = d i+1 d i. This allows to calculate the next delay sample as d i+1 = d i + δ i from the current delay d i and the sampled δ i. In a first step we determined the histogram H o for all deltas. Figure 6.11 shows this overall histogram. But as the delays feature an upper and lower bound, the values of delta can clearly not be independent of the current delay value. In figure 6.12, these bounds are plotted for the measured values of delay 5 as dashed horizontal lines. Due to the upper and lower bound d max and d min of the delays, there is only a limited range d min d i δ i d max d i to sample the deltas from for a given delay d i. This valid range can be calculated for all values d i. 109

110 6 Advanced Input Modeling Delay 5 Delay 5 [ns] Measured Delay Number Figure 6.12: Trace Plot of Delay 5 In figure 6.13 we plotted the observed delta values over the values of the current delay for the measured trace of delay 5. We also determined the valid range for δ i for all delay values. This resulted in a parallelogram-shaped area in the plot. All points (d i, δ i ) are located inside this parallelogram. Since the deltas depend on the current delay, it might not be correct to determine the overall histogram H o and neglect the dependency on the current delay. To see if there are more dependencies of δ i inside of the valid area, we constructed a number of histograms that include only the values of δ i for a certain range of d i. Once these histograms have been obtained, all of them can be combined in one three-dimensional histogram that shows the relative frequency of the occurring δ i for ranges of d i. As one can see from 6.14, the height of the surface is almost the same along the delay axis inside of the valid area. This shows that the deltas are to a large extend independent of the current delay inside of the valid area. Due to this fact, the overall histogram H o of all δ i, regardless of d i, makes sense. But since only values inside the valid area are used to construct the histogram H o, less values contribute to the bins for extreme negative and positive bins than 110

111 6.8 A New Model for Autocorrelated Data Delta 5 over Delay 5 Delta 5 [ns] 1e+05 5e+04 0e+00 5e+04 1e Delay 5 [ns] Figure 6.13: Delta over the Values of Delay 5 to the bins around zero. Therefore, an additional histogram H w is built, where the bins of the histogram H o are weighted with a weighting factor. The weighting factor is chosen so that the resulting histogram H w is the histogram that would be generated if delta values were observed for all delays d min d i d max. That means that we extrapolate the distribution of the δ i outside of the valid area from the values within. The factor w k for bin k can calculated as the ratio of the area A k that can contain delta values that fall into the respective bin in the extrapolated histogram to the area of the bin that is covered by the valid area C k, w k = A k /C k. The area A k of the delta values that potentially contribute to the bin of H w is delimited on the vertical axis by the borders of bin k and on the horizontal axis by d min and d max. The area C k is the part of the area A k that overlaps with the parallelogram of the valid area and thus the area that contains the delta values that contribute to the respective bin in H o. Figure 6.15 illustrates these areas for an exemplary bin of the values for delay 5. For an equidistant histogram with bin width b the area A k is constant: A k = b (d max d min ) k. The resulting weighting factors for a histogram of the 111

112 6 Advanced Input Modeling Histogram for Delta of Delay Frequency Delay Delta 50 0 Figure 6.14: 3D Histogram of Delta 5 deltas for delay 5 with 40 bins is shown in figure If o k denotes the number of observations in bin k of H o, then the number of observations in bin k of H w can be calculated as o k w k = o k A k /C k. The resulting histogram H w is the histogram that would be generated if delta values were observed for all delays d min d i d max, i.e. even outside of the valid area. The effect of the weighting process is shown in figure 6.17, where the overall histogram H o is overlaid with the weighted histogram H w. As the w k > 1 k, the number of potential observations in each bin of H w is obviously larger than the number of observations in the respective bin of H o. This is the effect of extrapolating from the valid area to the whole area. The histogram H w is used to construct an empirical distribution function for grouped data [45]. From this distribution function, values for δ i are sampled. These are used to calculate the next delay from the previous one as d i+1 = d i + δ i. But since there is a constraint on the allowed values for δ i depending on d i, only a certain part of the empirical distribution function is used for each sample. The 112

113 6.8 A New Model for Autocorrelated Data Weighting Areas Delta 5 [ns] A k C k Delay 5 [ns] Figure 6.15: Weighting Areas part is limited by the valid area and chosen so that δ i [d min d i ; d max d i ] for the current delay d i. In our example, the bounds for the delay 5 are d min = 64, ns and d max = 192, ns. When, e.g., the delay d j has reached 190, ns, δ j is sampled from the range [d min d j = 125, ns; d max d j = 2, ns]. Due to the weighting, a considerable amount of probability mass is found on the extreme negative end of this range and there is a high probability that the delay jumps from a high d j to a low d j+1 = d j + δ j by sampling a low negative δ j. Figure 6.18 compares the measured data with values generated using this method. The graphs show a close match of all relevant characteristics. We implemented this method of distribution fitting in R. The functions produce Java code that can be used in the discrete event simulation tool AnyLogic to generate random variates according to our new procedure. 113

114 6 Advanced Input Modeling Histogram Weighting Factor (40 Bins) Bin Weighting Factor e+05 5e+04 0e+00 5e+04 1e+05 Bin Midpoint Figure 6.16: Weighting Factors Frequency H o H w Histogram of Delta 5 1e+05 5e+04 0e+00 5e+04 1e+05 Delta 5 Figure 6.17: Original and Weighted Histogram for Delta 5 114

115 6.8 A New Model for Autocorrelated Data Delay 5 [µs] Measurement: Delay Delay Number Density Delay [µs] Correlation Lag Delay i Delay i Delay 5 [µs] Fit: Delay Delay Number Density Delay [µs] Correlation Lag Delay i Delay i Figure 6.18: Distribution Comparison for Delay 5 115

116 6 Advanced Input Modeling 116

117 7 Simulation Model When building a performance model of a complex system like our web cluster, the modeling formalism and the level of details have to be chosen according to the problem to be solved. Since one goal of this study was to gain insight into the internal behavior of the system, we chose to implement a rather detailed model. The measurement process was thus also implemented to reflect this level of detail. During the input modeling process it became clear that the probability distributions involved do not allow for the application of analytical methods without great simplifications. Trying to implement the the distributions as phase-type distributions would certainly lead to the problem of state-space explosion due to the inherent parallelism of the system. So discrete event simulation appeared as a sensible method to implement the model. Earlier modeling approaches have been conducted by students of our Simulation and Modeling II class. In their project work, they used a much simpler input modeling than presented in the previous chapter. Their simulation model has been implemented in the process-oriented discrete event simulation environment AutoMod. The model structure is relatively simple and most components used resemble either single or infinite server queues. Nevertheless, this type of model already gives essential hints for dimensioning of systems during the planning phase and they managed to publish the results in the journal Simulation Modeling Practice and Theory [101]. The modeling tool AnyLogic [98] has been used to build the current detailed simulation model of the web cluster. It has been developed in joint work with Isabel Wagner [87, 88] and the resulting model has been published in [89]. AnyLogic is a simulation tool that supports discrete event and continuous simulation. The main formalisms are UML-based. It does not support standard UML with profiles, but provides own real-time extensions to standard notations like state charts. It allows seamless integration of Java code in the models. Simulated entities are represented by Active Objects. The Active Objects can have internal behavior that can be 117

118 7 Simulation Model specified using state charts and Java code. They can communicate using ports. Ports can contain a FIFO queue and can be interconnected to transfer user-defined messages. These messages can be arbitrary Java objects. The Active Objects can be hierarchically structured. Therefore, an Active Object can encapsulate other Active Objects. The objects can also have a multiplicity. In the AnyLogic world, their instances are called replicated objects. Newer versions of the tool are integrated into the Eclipse framework and can be executed on a number of different operating systems like Windows, MacOS and Linux. The executable models are compiled Java bytecode that can be exported as a stand-alone Java applet. Newer versions of the tool also include formalisms to support other modeling paradigms besides the state chart base formalism like process flow simulation, agent-based simulation and system dynamics. A new way to visualize the decision logic and to specify the flow of control in the Active Objects are action charts. Besides all these formalisms, the user is supported with a number of pre-defined objects provided in libraries to speed up and simplify the process of model creation. For an efficient model, the user must be aware that all models are transformed to Java code and is advised to take care of the specific characteristics of Java like the garbage collection to implement models that can be executed and evaluated at high speed. For example it is advisable not to generate too many objects that are disposed in quick succession, as this requires frequent invocations of the garbage collector and slows down the execution of the model considerably. 7.1 Model Structure The global structure, in AnyLogic referred to as the Root Object, is composed of five building blocks that represent entities of the setup as it is used in the web cluster laboratory. The structure of the complete model is shown in figure 7.1. The HTTP requests are generated in the Client objects. The requests are encapsulated as TCP_package objects that represent TCP segments. These objects are transmitted over ports to the Active Object Channel1 that models the network channel between the load generators and the load balancer. The Load Balancer is the next object. It distributes the incoming segments among a number of Server instances that are connected via a the second network element Channel2. The modeled real server nodes process the requests and send TCP segments with reply data back to the client through the network channels and load balancing node. The 118

119 7.2 TCP model contains a configurable number of server nodes, as indicated by the stacked graphics in figure 7.1. Since the limitations of a client object should not limit the performance of the complete system, our model contains a separate client object for each HTTP transaction. These Active Objects are created dynamically. As the processing of different HTTP transactions overlap, there are usually more than one client objects present. Please note that this is just a brief sketch of the conceptual model. The implemented model is more complicated as it includes TCP dynamics, hardware and operating systems aspects. For example, TCP requires a connection setup using a three-way handshake before any data can be sent. The details of the various parts involved are presented in the following sections. Figure 7.1: Conceptual Model Variable parameters of the simulation are the distributions for the individual delays as shown in chapter 6, the arrival rate of client requests, the sizes of the requested objects, the number of real servers and the load balancing strategy to be employed. Simulation output data includes the individual delays in different elements of the cluster, the total delay and summary statistics like utilization, throughput and mean queue length for the network channels, the load balancer and each of the server processors. 7.2 TCP The model for TCP is a central aspect of our simulation, since all message transmissions are triggered by the TCP protocol. An Active Object TCP is present in the endpoints of a TCP connection, in our case, the client and server nodes. It implements the functionality of the TCP/IP stack of the operating system. For this purpose, it has interfaces for communication with a modeled application and to the network. The interface to the network allows to send TCP segments to be processed by connected entities. The working principles of TCP are specified in 119

120 7 Simulation Model several Requests for Comments (RFCs) issued by the Internet Engineering Taskforce (IETF). The model implements the most important aspects of TCP from the following RFCs RFC 793 The basic TCP dynamics have been specified in RFC 793 [84]. It defines the interface of the TCP layer to user mode applications. For that purpose, a number of commands are listed that must be implemented by the stack. The commands open, send, receive, close and abort have been modeled as individual ports of the Active Object over which an application can communicate. Since the status command is not useful in our simulation, it has not been implemented. Figure 7.2: TCP According to the specification, a node can use a port named open to open a connection actively or passively, depending on a flag. When a connection is opened actively, the node sends a TCP segment where the SYN flag is set to the network to initiate a handshake for a new TCP connection, whereas a passive open enables other systems to actively connect to this node as it is waiting for incoming connections. The close port closes the TCP connection by sending a FIN packet, but as the standard requires, allows to send outstanding data and retransmit data if needed. The abort port allows to terminate a connection by sending a RST 120

121 7.2 TCP frame. The receive and send ports are used to transfer data to and from the application, while the error port is used to inform the application about TCP errors. The rcv_flag is used to notify the application when data has arrived. Figure 7.2 shows the Active Object TCP. The ports are displayed as squares on the border of the object. Ports with queues contain a dot in the square. The variables are depicted as circles, state charts are represented as symbols that show two states with transitions and timers are drawn as a clock with a bell. Embedded Active Objects are displayed as boxes. As the illustration shows, the TCP object has two additional ports. The port total_delay is used to collect statistical information about the time spent in the stack. The port named packet is used for connection to the network. TCP in the transport layer is the lowest layer that is modeled explicitly. As all underlying layers do not change the behavior of the model we are interested in besides adding delays, all lower layers have been merged in what we call the network. Figure 7.3: Model of a TCP Segment Segments sent to the network are implemented as TCP_package Active Objects. As shown in figure 7.3, theses objects contain the payload and variables to represent the header fields as specified in RFC 793. For technical reasons, an additional timer delay has been included in the packet. This architecture simplifies the delay handling in the simulation. Two additional variables in this object are used for the purpose of collecting timing statistics. The protocol state machine from the RFC has been directly implemented in the model as a state chart named receive_packet. The states of the RFC are modeled as super-states. The actual processing of packets is done in the internal transitions. Whenever a transition into another super-state is required, a state change variable is set to contain the new state. This variable triggers transitions between super-states. This implementation allowed to reused code fragments and 121