Control 2004, University of Bath, UK, September 2004



Similar documents
Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

CHAPTER 1 INTRODUCTION

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

Thread level parallelism

Parallelism and Cloud Computing

Operating Systems 4 th Class

Multi-core Programming System Overview

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Weighted Total Mark. Weighted Exam Mark

Effective Computing with SMP Linux

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Parallel Scalable Algorithms- Performance Parameters

Computer Science 4302 Operating Systems. Student Learning Outcomes

Operating System Tutorial

Software and the Concurrency Revolution

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

A Comparison of Distributed Systems: ChorusOS and Amoeba

ELEC 377. Operating Systems. Week 1 Class 3

Chapter 1 Computer System Overview

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Multicore Programming with LabVIEW Technical Resource Guide

Course Development of Programming for General-Purpose Multicore Processors

Six Strategies for Building High Performance SOA Applications

Real-time processing the basis for PC Control

Chapter 2: OS Overview

Distributed Operating Systems

Real-Time Scheduling 1 / 39

Introduction to Cloud Computing

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Operating System Impact on SMT Architecture

Cellular Computing on a Linux Cluster

Chapter 11 I/O Management and Disk Scheduling

Application Performance Analysis of the Cortex-A9 MPCore

Muse Server Sizing. 18 June Document Version Muse

OPERATING SYSTEMS Internais and Design Principles

An Implementation Of Multiprocessor Linux

Real Time Programming: Concepts

Chapter 6, The Operating System Machine Level

Enabling Technologies for Distributed Computing

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

Multicore Parallel Computing with OpenMP

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

Introduction to DISC and Hadoop

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

From Control Loops to Software

Page 1 of 5. IS 335: Information Technology in Business Lecture Outline Operating Systems

Scalability Factors of JMeter In Performance Testing Projects

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Parallel Algorithm Engineering

Chapter 18: Database System Architectures. Centralized Systems

Choosing a Computer for Running SLX, P3D, and P5


FPGA-based Multithreading for In-Memory Hash Joins

Optimization of Cluster Web Server Scheduling from Site Access Statistics

Design and Implementation of the Heterogeneous Multikernel Operating System

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Real-Time Analysis of CDN in an Academic Institute: A Simulation Study

Cluster Computing at HRI

Distributed Systems LEEC (2005/06 2º Sem.)

Recommended hardware system configurations for ANSYS users

ADVANCED COMPUTER ARCHITECTURE

Optimizing Performance. Training Division New Delhi

Enabling Technologies for Distributed and Cloud Computing

Java Virtual Machine: the key for accurated memory prefetching

Topics. Producing Production Quality Software. Concurrent Environments. Why Use Concurrency? Models of concurrency Concurrency in Java

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Virtual Machines.

How To Understand The History Of An Operating System

Testing Database Performance with HelperCore on Multi-Core Processors

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

A Performance Analysis of Secure HTTP Protocol

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

CHAPTER 1: OPERATING SYSTEM FUNDAMENTALS

Rackspace Cloud Databases and Container-based Virtualization

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Capacity Estimation for Linux Workloads

Introduction to Operating Systems. Perspective of the Computer. System Software. Indiana University Chen Yu

Virtuoso and Database Scalability

theguard! ApplicationManager System Windows Data Collector

Transcription:

Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of Bradford, UK. Department of Automatic Control and Systems Engineering, The University of Sheffield, UK. INTRODUCTION The concept of concurrent multithreading in multiprocessor computing is currently a subject of widespread interest to the scientist and professional software developer. Concurrent multithreading in a multiprocessor domain is a growing technology with the potential to enable the development of much higher performance application as compared to the traditional method of software development in uniprocessor or multiprocessor environment. Threads reduce overheads by sharing fundamental parts, for instance, code (tex, data, stack, file I/O etc. By sharing these parts switching happens much more frequently. Although sharing information is not difficult, however, sharing with a dependency between threads cause degradation of overall computing performance. A single thread program possesses a single execution thread or library not designed to be used in a multithread environment. Whereas, a multithread program uses multiple execution threads or a library that allows concurrent access from multiple threads. Concurrent multithreading separates a process into many execution threads, each of which runs independently. Typically, applications that express concurrency requirements with threads need not take into account the number of available processors. The performance of the application improves transparently with additional processors. Numerical algorithms and applications with a high degree of parallelism, such as matrix multiplications, can run faster when implemented with threads on a multiprocessor Gray (), Lo et al (). Data dependency is one of the key issue in multithreading for real-time high performance computing. During implementation of an algorithm, data dependency between two blocks or two statements requires memory access time. In practice, increasing dependencies implies increasing access time or more inter-process communication in case of concurrent multithread implementation in multiprocessor domain. This, accordingly, degrades the real-time performance. Thus, it is essential to study and analyse data dependencies in an algorithm intended for real-time concurrent thread implementation. Such a study must address the essential question, such as, how to reduce the block or statement dependencies, and, what the impact of data dependencies in real-time inter-process communication will be. Detection of multithreading scope in an application involves finding sets of computations that can be performed simultaneously. The approach to parallel multithreading is thus based on the study of data dependencies. The presence of dependence between two computations implies that they can not be performed in parallel. In general, the fewer the dependencies, the greater the parallelism, Moldovan (3). Many algorithms have regular data dependencies that is, dependencies that repeat over the entire set of computations in the algorithm. For such algorithms, dependencies can be concisely described mathematically and can be manipulated easily. However, there are algorithms for which dependencies vary from one computation to another and these algorithms are more difficult to analyze. When two or more algorithms have similar dependencies, it means that they exhibit similar parallel properties, Hossain et al (). Performance of the synchronization mechanism of a multi-processor determines the granularity of parallelism that can be exploited on that machine. Synchronization on a multiprocessor carries a high cost due to the hardware levels at which synchronization and communication must occur, Tullsen et al (5). Therefore, the study of concurrent multithreading also includes inter-process communication, issues of granularity of the algorithm and of the hardware and regularity of the algorithm. Hardware granularity can be defined as the ratio of computational performance over the communication performance of each processor within the architecture. Similarly, task granularity can be defined as the ratio of computational demand over the communication demand of the task, Nocetti and Fleming (6). More clearly, Run time length of a task R Task granularity = = Communication overhead C Typically a high compute/communication ratio is desirable. The concept of task granularity can also be viewed in terms of compute time per task. When this is large, it is a coarse-grain task implementation. When it is small, it is a fine-grain task implementation. Although, large grain tasks may ignore potential parallelism, partitioning a problem into the finest

Control, University of Bath, UK, September ID- possible granularity does not necessarily lead to the fastest solution, as maximum parallelism also has maximum overhead, particularly due to increased communication requirements. Therefore, when partitioning the algorithm into subtasks and distributing these across the processing elements, it is essential to choose an algorithm granularity that balances useful parallel computation against communication and other overheads, Tokhi and Hossain, (7), Liau and Prasanna (8). In this paper, an active vibration control algorithm for a flexible beam system is employed to study and explore real-time concurrent multithread implementation issues in high performance multiprocessor domain. A dual processor based machine comprises of two Intel processors (GHz) is considered as the computing domain to demonstrate the critical issues for concurrent multithreading in real-time computing. This investigation also addresses the issues of synchronization, task granularity, load balancing and inter-process communication problem due to data and control dependencies in the algorithm. Finally, a comparison of the results of the implementations is made on the basis of real-time performance to explore systems design incorporating multithreading techniques in multiprocessor domain for real-time control. divided into n sections). Equation () is the required relation for the simulation algorithm that can be implemented on a computing domain easily. A schematic diagram of an AVC structure is shown in Figure. A detection sensor detects the unwanted (primary) disturbance. This is processed by a controller to generate a cancelling (secondary, control) signal so that to achieve cancellation at the observation point. The objective in Figure is to achieve total (optimum) vibration suppression at the observation point. Synthesising the controller on the basis of this objective yields, Tokhi and Hossain () Q C = (3) Q where, Q and Q represent the equivalent transfer functions of the system (with input at the detector and output at the observer) when the secondary source is off and on respectively. ACTIVE VIBRATION CONTRO ALGORITHM Consider a cantilever beam system with a force U ( x, applied at a distance x from its fixed (clamped) end at time t. This will result in a deflection ( x y, of the beam from its stationary position at the point where the force has been applied. In this manner, the governing dynamic equation of the beam is given by, y x ( x, y( x, + t = U m ( x, µ () where, µ is a beam constant and m is the mass of the beam. Discretising the beam in time and length using the central FD methods, a discrete approximation to equation () can be obtained as, Virk and Kourmoulis (9), ( ( x Yk = Yk SYk + U, m + λ () [ ] where, λ ( t ) ( x) µ = with t and x representing the step sizes in time and along the beam respectively, S is a pentadiagonal matrix (the so called stiffness matrix of the beam), Y i ( i = k +, k, k ) is an ( n ) matrix representing the deflection of end of sections to n of the beam at time step i (beam Primary source Detector C Secondary source Figure : Active vibration control structure Observed Signal To investigate the nature and real-time processing requirements of the AVC algorithm, it is divided into two parts, namely control and identification. The control part is tightly coupled with the simulation algorithm, and both will be described in an integral manner as the control algorithm. The simulation algorithm will also be explored as a distinct algorithm. Both of these algorithms are predominately matrix based. The identification algorithm consists of parameter estimation of the models Q and Q and calculation of the required controller parameters according to equation (3). However, the nature of identification algorithm is completely different as compared with the simulation and control algorithms, Tokhi and Hossain (7). Thus, for reasons of consistency only the simulation and control algorithms are considered in this investigation. The simulation algorithm forms a major part of the control algorithm. Thus, of the two algorithms, the simulation algorithm is considered here to address the issue of data dependency. Consider the simulation algorithm in equation (). This can be rewritten, for

Control, University of Bath, UK, September ID- computing the deflection of segments 8 and 6, as follows, assuming no external force applied at these points: y[8][8] y[8][6] (lambda) *(y[6][7] *y[7][7] + b*y[8][7] *y[9][7]+y[][7]); y[6][6] y[6][] (lambda) *(y[][5] *y[5][5] + b*y[6][5] *y[7][5]+y[8][5]); It is also noted that computation of deflection of a particular segment is dependent on the deflection of six other segments. These heavy dependencies could be major causes of performance degradation in real-time sequential computing, due to memory access time. On the other hand, these dependencies might cause significant performance degradation in real-time concurrent multithreading due to inter-process communication overheads, Hossain et al (). HARDWARE AND SOFTWARE RESOURCES To demonstrate the issues of real-time multithread implementation in multiprocessing environment, a computing domain with dual Pentium III processors was used. The domain comprises of dual PIII (GHz) processors, MB RDRAM (8MHz clock speed) as main shared memory and a 56 KB L cache per processor. This is a DELL multitasking machine (Brimstone) with Linux operating system version..6. The g++ (using gcc) compiler version.96 was used to develop source code for real-time implementation. IMPLEMENTATION AND RESULTS To investigate the parallel multithread implementation of the flexible beam simulation algorithm of a cantilever beam of length L =. 635 m, mass m =. 375 Kg and λ =. 369 was considered. For an adequate accuracy to be achieved in representing the first few (dominan modes of vibration of the beam, the beam was divided into segments or above. Figure shows the execution time for a single thread and two threads with synchronization. It is noticed that due to the synchronization, inter-process communication overheads and other operating system overheads, a significant level of performance degradation is observed for two threads as compared to a single thread. In an ideal situation, it is expected that the execution time in implementing the algorithm with two threads within dual processors domain should be half. However, the actual performance achieved for two threads is even much lower than a single thread and this trend continues to increase with the increment of the number of iterations. Execution time (sec)..8.6.. Single thread Two threads (with Sync.) Figure : Execution for a single thread and two threads with synchronization (for segments) Figure 3 shows the execution time for single thread and two threads without synchronization (considering that there are no data dependencies between two threads) for similar computation as discussed above. The basic objective to compute without synchronization is to address the impact of the overhead of synchronization. It is evident that the speedup for two threads in dual processors environment is double as compared to a single thread. And, this remains almost consistent with increment of the number of iteration. Figure shows the execution time for two threads with and without synchronization. It is further observed that the impact of synchronization for inter-process resource sharing due to data dependencies is extremely high. This, in turn, causes a significant level of performance degradation in two threads execution in dual processors domain in implementing the simulation algorithm. Table shows the multithread execution time with and without synchronization. It is noticed that a significant level of performance is achieved for two threads as compared to a single thread execution time without synchronization. However, this improvement is not consistent or evident with the further level of increments of threads. This could be due to the limitation of the number of processors in the domain. This demonstrates that the domain which comprises of two processors, two threads might be the best mapping rather further level of granularity. In addition, operating system's overhead, scheduling and mapping problem could cause a further level performance degradation with the increments of threads.

Control, University of Bath, UK, September ID-.5.5..35.3.5..5..5 Single thread Two threads(without syc.) Figure 3: Execution time for a single thread and two threads without synchronization (for segments) TABLE - Multithread execution time in implementing the simulation algorithm with and without synchronisation (WS = With synchronisation, WOS = without synchronisation, T = number of threads) T 3, 6, 9, WS WOS WS WOS WS WOS.....5.5.33..66....66...5.9.7 6.98.3.6.6.86.7 8.6.3.75.6 3.86.9 Execution time (sec)..8.6.. Two threads (no sync.) Two threads (with Sync.) Figure : Execution time for two threads with and without synchronization (for segments) Table shows the communication overhead and task granularity for 9, iterations. It is noticed that the level of communication overhead is significantly higher as compared to actual computation time which, in turn, reduces the task granularity ratio. This is further demonstrated that the impact of overhead due to a significant level of thread dependencies, in turn, inherent data dependencies of the simulation algorithm. To explore the capabilities of concurrent multithreading further, an I/O work was inserted within the flexible beam simulation algorithm. The I/O work was involved for file saving into the secondary memory. The algorithm was implemented in accordance with the design mechanism shown in Figure 5 enforcing both concurrently executing threads to wait for each other. The first and most important observation that can be made from this graphical representation in Figure 6 is that the performance of the program actually increased beyond a certain size of segments. This could be due to balanced loading of the threads which improves the overall CPU utilisation of the program and reduces the OS overhead generated by threads being suspended when trying to acquire a lock on a locked mutex variable. Once the calculation thread went sufficiently busy, the probability of triggering a synchronisation lock method was drastically improved, thus resulting in higher performance. It appeared odd but intriguing that performance could be improved by adding more tasks to one thread. TABLE - Communication overhead and task granularity for multithreading for 9, iterations (T = No. of threads, E = execution time with communication overhead, R = actual computation time, C = communication overhead and R/C= task granularity ) T E R C R/C.5.5 ----- -----...8.7.9.7.85.38 6.86.7.79.5 8 3.86.9 3.77. Finally, the control algorithm was implemented to explore the impact in real-time implementation. It was noted earlier that the control algorithm is tightly coupled with simulation algorithm and computational load for calculating control response in each iteration is equivalent to the computation of one segment of the beam. It is observed from the experiment that the execution time for control algorithm is similar level of simulation algorithm and the complexity in terms of multithreading remains similar. Therefore, detail description in implementing control algorithms will remain similar and is not provided here to avoid duplication. CONCLUDING REMARKS An investigation into the dependency and load balancing of multithreading in multiprocessor domains for real-time control, emphasising thread synchronization, has been presented, and demonstrated the real-time implementation of the algorithm. It has been demonstrated that data dependency and load balancing are the critical constraints of multithreading for real-time implementation. It has also been evident that to achieve enhanced performance, the number of threads has correlation with the number of processors

Control, University of Bath, UK, September ID- into domain. It has been noticed that fine grain threading with higher level of data dependency could cause significant level of performance degradation. In this particular application, increment of threads increases the synchronization overhead, which, in turn, relates with inter-process dependencies, operating system s overhead for run-time scheduling, and memory management. Finally, it is noted that the impact of synchronization, data dependencies, load balancing and granularity of algorithm are critical issues for real-time multithreading in multiprocessor domain. Figure 5: Algorithm design mechanism (MUTEX represents mutual exclusion) Seconds MUTEX A 3.5 3.5.5 lock Store beam positions MUTEX B unlock Start, Initialisatio MUTEX B lock Perform calculations MUTEX A unlock 7 7 Segments Figure 6: Performance of the algorithm with file I/O for, iteration REFERENCES. Gray, J. S., 998, "Interprocess communications in UNIX", Prentice Hall PTR, USA, ISBN -3-89959-3.. Lo, J. L., Barroso, L. A., Eggers, S. J., Gharachorloo, K., Levy, H. M. and Parekh, S. S., 998, "An analysis of database workload performance on simultaneous multithreaded processors", Proceedings of 5th Annual International Symposium on Computer Architecture, June, USA. 3. Moldovan, D. I., 993, Parallel processing from applications to systems, Morgan Kaufmann Publishers, USA.. Hossain, M. A., 995, Digital signal processing and parallel processing for real-time adaptive noise and vibration control, Ph.D. thesis, Department of Automatic Control and System Engineering, The University of Sheffield, UK. 5. Tullsen, D. M., Lo, L. J., Eggers, S. J. and Levy, H. M, 999, "Supporting fine-grained synchronization on a simultaneous multithreading processor", Proceedings of the 5th International Symposium on High Performance Computer Architecture, January, USA. 6. Nocetti, G. D. F. and Fleming, P. J., 99, Performance studies of parallel real-time controllers, Proceedings of IFAC Workshop on Algorithms and Architectures for Real-time Control, Bangor, UK, 9-5. 7. Tokhi, M. O. and Hossain, M. A., 995, Homogeneous and heterogeneous parallel architecture in real-time signal processing and control, Control Engineering Practice, 3(), 675-686. 8. Liau, W. and Prasanna, V. K., 998, Utilising the power of high performance computing, IEEE Signal Processing Magazine, 5(5), 85-. 9. Virk., G. S. and Kourmoulis, P. K., 988, On the simulation of systems governed by partial differential equations, Proceedings of IEE Control- 88 Conference, 38-3.. Tokhi, M. O. and Hossain, M. A., 99, Selftuning active vibration control in flexible beam structures, Proc. IMECE-I: J. Systems Control Eng. 8(), 63-77.. Hossain, M. A., Kabir, U. AND Tokhi, M. O.,, Impact of data dependencies for real-time high performance computing, Journal of Microprocessors and Microsystems, 6(6), 53 6.