SCALABILITY AND AVAILABILITY Real Systems must be Scalable fast enough to handle the expected load and grow easily when the load grows Available available enough of the time Scalable Scale-up increase hardware so it is bigger and faster o Increasing the capacity of the system o No need for load balancing just use a bigger box o Runs into limits eventually o Could provide less availability what happens if there is a failure, and there is no redundancy o Could be easier to manage o Does not guarantee increased performance o Easiest solution, if it works and you have lots of money Scale-out systems working together to handle the load (server farms, clusters) o Have multiple systems working together, programmers must be able to handle this o Add more boxes at every level the critical parts of the system Web servers for handling user interface Application serves for running business logic Database servers tricky to implement this o Spread load across the boxes Load balancing at every level Partitioning or replication for database Impact on application design Impact on system management o Another constraint may be directly behind the current constraint and therefore that will need fixing as well o Database may be able to add more machines but complex o Major impacts on application design and administration, management complexity increases considerably Implication for application design, especially in the management of state Availability Includes maintenance the key is redundancy Goal is 100% availability, 24x7 operations, including time for maintenance Redundancy is the key to availability, having no single point of failure, spare everything on hand How much o 99% - 87.6 hours per year o 99.9& - 8.76 hours per year o 99.99% - 0.876 hours per year Need to consider operations as well Robert Whitaker 1
o o Maintenance, software upgrades, backups, application changes Not just faults and recovery time Scalability Growth in performance Response time, instant is good Need to specify an acceptable response time. They need to be consistent. Response times usually vary between different transaction types. Different classes have different times. Typically if a response time is constant, then users accept it and don t notice it, however if the response time fluctuates widely then users will be unhappy. Performance How fast is the system o Not the same as scalability but related o Measured by response time and throughput How scalable is the system o Is concerned with the upper limits of the system o How big can it grow, and how does it grow Response Time What delay does the user see Response times vary with the complexity of a transaction. These include fast read only transactions which are fast, slower update transactions, and any which require opening a connection to the database is slow Throughput How many transactions can be handled in some period of time o Transactions per second o A measure of overall capacity o Inverse of response time There are standard benchmarks for measuring this Capacity of the system Will increase until some resource limit is hit o Adding more clients just increases the response time o Run out of processor, disk bandwidth, network bandwidth o Some resources overload badly Contention for shared resources Ethernet network performance degrades. Log file you only want the one file open on disk, because you don t want to have to move the heads often, thus giving max disk performance. Want as few head movements as possible System Capacity How many clients can you support Robert Whitaker 2
o Need to specify an acceptable response time o Plot response time v number of clients Great if you can run benchmarks reason for prototyping and proving proposed architectures before leaping into full scale implementation Every system has a constraining resource and can be extended until you reach the constraining resource. Load Balancing Balancing client bindings across servers or processes o Needed for stateful systems o Static allocation of client and server Balancing requests across server systems or processes o Dynamically allocating requests to servers o Normally only done for stateless systems CORBA Implementation o Clients calls on name server to find the location of a suitable server name server is the terminology for object directory o Name server can spread client objects across multiple servers often round robin o Client is bound to a server and stays bound forever this can lead to performance problems if server loads become unbalanced Dynamically balance load across servers requests from a client can go to any server Requests Dynamically routed often used for web server farms Routing decisions has to be fast router in main processing path Applications normally stateless Static o Dynamic o Bind a client to a server or process on a server, need that binding for stateful systems Balance requests across a number of servers to spread the load uniformly across those services. If you push the button twice you are likely to go across to different servers. But each server may have a static binding to an application server COBRA Name server s job is to distribute client requests across different instances. It does something similar to round robin. Once a client is bound to a server, it is bound forever. The problem with binding a client to a server forever is that we may under-utilise servers, because many clients are bound to one busy server, and other less busy ones sit around idle Name Server Server processes call name server when they come up advertising what services they are offering Robert Whitaker 3
Clients call name server to find the location of a server process it is up to the name server to match clients to servers Clients call server process to create objects Can perform dynamic load balancing with stateful servers Clients can throw away server objects and get new ones every now and then, this is implemented in the application code or middleware Or you can perform object replication in the middleware o Have copies of the same object on all servers o Replication of changes made over all servers o Clients have references to all copies of the object Dynamic Stateful Save the state somewhere and restore the state if needed Replicate the stateful object over different servers Web Logic 2 servers, A and B. we have full replication of the statful object across A and B On client request, both servers get the object created, and the client can choose which server to use. On commit of changes, both server objects are updated The reason for this is because if machine B dies, machine A takes over Dynamic Load Balancing Equally across all servers Requests can go to any server Build web servers this was Need routing requests IP Sprayer one IP which pushes out connections over N ports. Must be reliable cause all connections must go through it and controls the reliability of your system and application. Network load balances, splits requests based on IP The request routing needs to be fast and reliable, as it is the main request path stateless Web Server Farms Are highly scalable Is a type of cluster Web Applications are normally stateless o Next request can go to any web server o State comes from client or database Just need to be able to spread the requests across the machines. Clusters A group of independent computers acting like a single systems o Shared disks each server shares access to the disks in the system Robert Whitaker 4
o Single IP address o Single set of services o Fail over to other members of the cluster o Each server in the cluster knows the status of other servers Group of independent autonomous computers acting as a single machine. The machine inside the cluster are sharing some resources Machine inside cluster takeover from failed machines, transparent failovers Some do load sharing within the cluster Improves scalability by adding more resources/boxes Address scalability add more boxes to the cluster and the replication or sharing of storage Address availability allows you to add or remove boxes from the cluster for maintenance and upgrades Can be used as one element of a highly available system Heartbeats between machines will allow them to monitor one another. If A sees B has died it can take over. Availability how often it fails plus how often it is available. Clusters allow you to take down a machine and the cluster continuous to work especially for maintenance purposes. Harder to scale state stores Threaded Servers Allows to spread the load of individual processors A process may have a lot of requests while an identical one has none, it would be better if we can spread the load evenly. We can have process load balancing or processor load balancing, CORBA uses process instance load balancing to be system independent, because all systems use processes, so its implementation is portable Modern approaches uses thread pools. A client can be bound to one process and have its requests handled by a thread, so even busy processes can handle the requests No need for load balancing within a single system o Multithreaded server process thread pool servicing requests o All objects live in a single process space o Any requests can be picked up by any thread Scaling Data Stores Much harder to do because of ACID The data stores hold state Solution: buy more hardware Replication: make multiple copies, useful for high contention data and is not always updated but is shared a lot. The trick is to change something and replicate the change to all other copies Partitioning: if there is one database that contains all the customers of a business, it might be useful to partition these customers and distribute the partitions across Robert Whitaker 5
different physical locations. This will help to reduce contention by distributing the partitions, and redirecting requests to different databases. The problem is what happens if we want to get all the customers. We can use a partitioned view, by specifying all the database partitions and asking it to return a view for all customers. To obtain scalability at each partitions physical location we can use a cluster. Availability through Redundancy Redundancy through the addition of spare equipment Active standby the redundant system must monitor everything that the system is redundant for, and ensures that it can jump in immediately when needed. It must always be updated Passive system sits in wait, and jumps in if needed. They need to play catch up getting back to the state the failed system was in before it died For an active standby system, we have a copy of the database. All request to the active system is sent to the standby one as well Fragility Large distributed synchronous systems are not roust. With such tight coupling if a remote system suddenly dies because of failure, you have to wait for a response which may be forever Asynchronous is better as if the system is down it will eventually get there We rely on guarantees, the middleware usually makes these. Problem is with committed transaction, what do we do when we find out later that the transaction failed Availability and Scalability Often a question of application design Stateful v stateless o What happens if a server fails o Can requests go to any server Synchronous method calls or asynchronous messaging o Reduce dependency between components o Failure tolerant designs Manageability decisions to consider Robert Whitaker 6