EEC-681/781 Distributed Computing Systems Lecture 2 Outline Overview of distributed systems Design Goals (part 2) Hardware Concepts Software Concepts 2 Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Review of Lecture 1 3 Definition of a Distributed System 4 Definition of distributed systems Design goals (part 1) Original: A collection of independent computers that appear to the users as a single coherent system Modified: A piece of software that ensures a collection of autonomous computers to appear as a single coherent system Autonomous computers connected by a network Software specifically designed to provide an integrated computing facility 1
Design Goals 5 Distribution Transparency 6 Connecting users and resources Transparency: Users feel like they are using a single-user system Openness: Services provided are based on standards Flexibility: Separation of policy and mechanisms Access transparency Location transparency Migration transparency Relocation transparency Replication transparency Concurrency transparency Scalability, availability, security Persistency transparency Distribution Transparency Mini-quiz 7 Distribution Transparency Mini-quiz 8 The transparency that hides where a resource is located is called: a) Access transparency b) Location transparency c) Relocation transparency d) Migration transparency The transparency that hides differences in data representation and how a resource is accessed a) Access transparency b) Location transparency c) Relocation transparency d) Migration transparency The transparency that hides the fact that a resource may move to another location is called a) Access transparency b) Location transparency c) Relocation transparency d) Migration transparency The transparency that hides the fact that a resource may be moved to another location while in use is called a) Access transparency b) Location transparency c) Relocation transparency d) Migration transparency 2
9 10 Replication Transparency Concurrency Transparency Replication transparency - Hide that a resource is replicated More than one copy is available All replica should have the same visible name Concurrency transparency - Hide that a resource may be shared by several competitive users This feature is really nothing new. Operating systems have been offering concurrency transparency for a number of decades Easy to guarantee if accesses to the same resource are all read-only Care must be taken to maintain consistence if some accesses are updates 11 12 Failure Transparency Persistency Transparency Failure Transparency - Hide the failure and recovery of a resource Can be achieved through replication But, very challenging and costly in general Persistency Transparency - Hide whether a (software) resource is in memory or on disk 3
Degree of Transparency 13 Degree of Transparency 14 Observation: Aiming at full distribution transparency may be too much Sometime distribution is apparent and not something you want to hide, e.g., users may be located in different continents Completely hiding failures of networks and nodes is (theoretically and practically) impossible You cannot distinguish a slow computer from a failing one You can never be sure that a server actually performed an operation before a crash Full transparency will cost performance, exposing distribution of the system Keeping Web caches exactly up-to-date with the master copy Immediately flushing write operations to disk for fault tolerance Openness of Distributed Systems Open distributed system: Be able to interact with services from other open systems, irrespective of the underlying environment Systems should conform to well-defined interfaces Systems should support portability of applications Systems should easily interoperate 15 Openness of Distributed Systems Achieving openness: At least make the distributed system independent from heterogeneity of the underlying environment Hardware Platforms Languages 16 4
Implementation Openness 17 Implementation Openness 18 Openness requires flexibility Implementing openness: Requires support for different policies specified by applications and users What level of consistency do we require for client cached data? Which operations do we allow downloaded code to perform? Which QoS requirements do we adjust in the face of varying bandwidth? What level of secrecy do we require for communication? Implementing openness: Ideally, a distributed system provides only mechanisms: Allow (dynamic) setting of caching policies, preferably per cacheable item Support different levels of trust for mobile code Provide adjustable QoS parameters per data stream Offer different encryption algorithms 19 20 Mechanisms and Policies Example: Managing a Queue Mechanisms determine how to do something while policies decide what should be done The separation of policy from mechanism allows maximum flexibility in choosing policies and if policy decisions are to be changed later Let s use an abstract priority queue as example We need to support mechanisms for: Insert/Delete items at start Insert/Delete items at end Know length of queue The queue can be implemented in different ways Policies can be for example FIFO, LIFO should be decided by queue user 5
Scale in Distributed Systems 21 Size Scalability 22 Scalability can be measured at three dimensions: Size scalability We can easily add more users and resources to the system Geographical scalability users and resources may lie far apart geographically Administrative scalability The system can still be easy to manage even if it spans many independent administrative organizations Thomas J. Watson, Chairman of IBM, 1943: I think there is a world market for maybe five computers Internet: July 1993: 1,776,000 computers July 1999: 56,218,000 computers Scalability problems in distributed systems appear as performance problems caused by limited capacity of servers and network January 2002: 168,000,000 computers and > 23,000,000 DNS domains 23 24 Size Scalability Problems Size Scalability Problems Concept Centralized services Centralized data Centralized algorithms Example A single server for all users A single on-line telephone book Doing routing based on complete information Problem running centralized algorithms in distributed systems Would result in enormous number of messages have to be routed over many lines Any algorithm that operates by collecting information from all sites, sends it to a single machine for processing, and then distributes the results must be avoided 6
Decentralized Algorithm Characteristics 25 Geographical Scalability Problems 26 No machine has complete information about the system state Machines make decisions based only on local information Failure of one machine does not ruin the algorithm There is no implicit assumption that a global clock exists Interprocess communication in WANs has much longer latency than that in LANs Communication in WANs is inherently unreliable, and virtually always point-to-point Centralized components would reduce geographical scalability, just as does to size scalability Administration Scalability Problems 27 Techniques for Scaling 28 Different administrative domain usually impose different policies, e.g., with respect to resource usage, management, and security Hiding communication latencies Distribution Replication 7
Hiding Communication Latencies 29 Hiding Communication Latencies Move Computation to Clients 30 Hiding communication latencies is applicable to in the case of geographical scalability Technique #1: Try to avid waiting for responses to remote service requests as much as possible Technique #2: Reduce the overall communication by moving part of the computation that is normally done at the server to the client process requesting the service Technique for Scaling - Distribution 31 Decentralized Naming Service 32 Distribution: Partition data and computations across multiple machines Examples: Domain name services (DNS) DNS name space is hierarchically organized into a tree of domains, which are divided into nonoverlapping zones 8
Techniques for Scaling Replication 33 Techniques for Scaling Replication 34 Replication: Make copies of data available at different machines across the distributed system Examples: Replicated file servers (mainly for fault tolerance) Replicated databases Mirrored Web sites Large-scale distributed shared memory systems Replication not only increases availability, but also helps to balance the load between components leading to better performance Replication also help increase the geographical scalability by placing a copy nearby different users Problem with Scaling by Replication 35 Problem with Scaling by Replication 36 Applying scaling techniques through replication sounds straightforward, but be aware that having multiple copies might leads to inconsistencies: modifying one copy makes that copy different from the rest Always keeping copies consistent and in a general way requires global synchronization on each modification Global synchronization precludes large-scale solutions If we can tolerate inconsistencies, we may reduce the need for global synchronization Tolerating inconsistencies is application dependent 9
Techniques for Scaling Caching Caching: A special form of replication. It allows client processes to access local copies Web caches (browser/web proxy) File caching (at server and client) Similarity to replication: making a copy of a resource, generally in the proximity of the client accessing that resource Difference from replication: caching is a decision made by the client of a resource, not by the owner of the resource 37 Distributed Systems: Hardware Concepts Multiprocessors Multicomputers Networks of Computers 38 Multiprocessors and Multicomputers 39 Networks of Computers 40 Distinguishing features: Private versus shared memory Bus versus switched interconnection High degree of node heterogeneity: High-performance parallel systems (multiprocessors as well as multicomputers) High-end PCs and workstations (servers) Simple network computers (offer users only network access) Mobile computers (palmtops, laptops) Multimedia workstations High degree of network heterogeneity: Local-area gigabit networks Wireless connections Wide-area switched megabit connections 10
Distributed Systems: Software Concepts 41 Distributed Operating Systems 42 An overview between DOS (Distributed Operating Systems) NOS (Network Operating Systems) Middleware Some characteristics OS on each computer knows about the other computers OS on different computers generally the same Services are generally (transparently) distributed across computers System Description Main Goal DOS NOS Middleware Tightly-coupled operating system for multiprocessors and homogeneous multicomputers Loosely-coupled operating system for heterogeneous multicomputers (LAN and WAN) Additional layer atop of NOS implementing general-purpose services Hide and manage hardware resources Offer local services to remote clients Provide distribution transparency Multicomputer Operating Systems Harder than traditional (multiprocessor) OS, because memory is not shared, emphasis shifts to processor communication by message passing: Often no simple global communication No simple system-wide synchronization mechanisms Virtual (distributed) shared memory requires OS to maintain global memory map in software Inherent distributed resource management: no central point where allocation decisions can be made 43 Network Operating System Some characteristics: Each computer has its own operating system with networking facilities Computers work independently (i.e., they may even have different operating systems) Services are tied to individual nodes (ftp, ssh) Highly file oriented (basically, processors share only files) 44 11
45 46 Network Operating System Network Operating System Two clients and a server in a network operating system Different clients may mount the servers in different places. 47 Distributed System (Middleware-based) Characteristics: OS on each computer need not know about the other computers OS on different computers need not generally be the same Services are generally (transparently) distributed across computers 12