8 Interconnection. Networks and Clusters 8

Transcription

1 8 Itercoectio Networks ad Clusters 8 The edium is the essage because it is the medium that shapes ad cotrols the search ad form of huma associatios ad actios. arshall cluha Uderstadig edia (1964) The marvels of film, radio, ad televisio are marvels of oe-way commuicatio, which is ot commuicatio at all. ilto ayer O the Remote ossibility of Commuicatio (1967)

2 8.1 Itroductio A Simple Network Itercoectio Network edia Coectig ore Tha Two Computers Network Topology ractical Issues for Commercial Itercoectio Networks Examples of Itercoectio Networks Iteretworkig Crosscuttig Issues for Itercoectio Networks Clusters esigig a Cluster uttig It All Together: The Goggle Cluster of Cs Aother View: Iside a Cell hoe Fallacies ad itfalls Cocludig Remarks Historical erspective ad Refereces 654 Exercises Itroductio Thus far we have covered the compoets of a sigle computer, which has bee the traditioal focus of computer architecture. I this chapter we see how to coect computers together, formig a commuity of computers. Figure 8.1 shows the geeric compoets of this commuity: computer odes, hardware ad software iterfaces, liks to the itercoectio etwork, ad the itercoectio etwork. Itercoectio etworks are also called etworks or commuicatio subets, ad odes are sometimes called ed systems or hosts. The coectio of two or more itercoectio etworks is called iteretworkig, which relies o commuicatio stadards to covert iformatio from oe kid of etwork to aother. The Iteret is the most famous example of iteretworkig. There are two reasos that computer architects should devote attetio to etworkig. I additio to providig exteral coectivity, oore s Law shruk etworks so much that they coect the compoets withi a sigle computer. Usig a etwork to coect autoomous systems withi a computer has log bee foud i maiframes, but today this such desigs ca be foud i Cs too. Switches are replacig buses as the ormal commuicatio techique: betwee

3 564 Chapter 8 Itercoectio Networks ad Clusters Node Node Node Node SW iterface SW iterface SW iterface SW iterface HW iterface HW iterface HW iterface HW iterface Lik Lik Lik Lik Itercoectio etwork FIGURE 8.1 rawig of the geeric itercoectio etwork. computers, betwee I/O devices, betwee boards, betwee chips, ad eve betwee modules iside chips. As a result, computer architects must uderstad etworkig termiology, problems ad solutios i order to desig ad evaluate moder computers. The secod reaso architects should study etworkig is that today almost all computers are--or will be--etworked to other devices. Thus, uderstadig etworkig is critical; ay device without a etwork is somehow flawed. Just as a moder computer without a memory hierarchy broke hece a chapter just for it a moder computer without a etwork is broke too. Hece this chapter. This topic is vast, with portios of Figure 8.1 the subject of whole books ad college courses. Networkig is also a buzzword-rich eviromet, where may simply ideas are obscured behid acroyms ad uusual defiitios. To help you breakthrough the buzzword barrier, Figure 8.2 is a glossary of about 80 etworkig terms. The goal of this chapter is to provide computer architects a getle, qualitative itroductio to etworkig. It defies terms, helps you uderstad the architectural implicatios of itercoectio etwork techology, provides itroductory explaatios of the key ideas, ad give refereces to more detailed descriptios. ost of this chapter is o etworkig, but the fial quarter of this chapter focuses o clusters. A cluster is the coordiated use of itercoected computers i a machie room. I cotrast to the qualitative etwork itroductio, these sectios give a more quatitative descriptio of clusters, icludig may examples. It eds with a guided tour of the Google clusters.

4 8.1 Itroductio 565 Term adaptive routig AT atteuatio backpressure flow cotrol badwidth base statio bisectio badwidth bit error rate blade blockig bridge category 5 wire carrier sesig chael checksum circuit switchig cluster coaxial cable collisio collisio detectio efiitio Router picks best path based upo measure of delay o outgoig liks Asychroous Trasfer ode is a WAN desiged for real-time traffic such as digital voice Loss of sigal stregth as sigal passes through the medium over a log distace Whe the receiver caot accept aother message, separate wires betwee adjacet seders ad receivers tell the seder to stop immediately. It causes liks betwee two ed poits to freeze util the receiver makes room for the ext message. aximum rate the etwork ca propagate iformatio oce the message eters the it A etwork architecture that uses boxes coected via lad lies to commuicate to wireless hadsets Sum of the badwidth of lies that cross that imagiary dividig lie betwee two roughly equal parts of the etwork, each with half the odes BER, the error rate of a etwork, typically i errors per millio bits trasferred A removable computer compoet that fits vertically ito a box i a stadard VE rack Cotetio that prevets a message from makig progress alog a lik of a switch OSI layer 2 etworkig device that coects multiple LANs, which ca operate i parallel; i cotrast, a router coects etworks with icompatible addresses at OSI layer 3 Cat 5 twisted-pair, copper wire used for 10, 100, ad 1000 bits/sec LANs Listeig to the medium to be sure it is uused before tryig to sed a message I wireless etworks, it is a pair of frequecy bads that allow 2-way commuicatio A field of a message for a error correctio code A circuit is established from source to destiatio, reservig badwidth alog a path util the circuit is broke Coordiated use of itercoected computers i a machie room A sigle stiff copper wire is surrouded by isulatig material ad a shield; historically faster ad loger distace tha twisted pair copper wire Two odes (or more) o a shared medium try to sed at the same time Listeig to shared medium after sedig to see if a message collided with aother FIGURE 8.2 Networkig terms i this chapter ad their defiitios

5 566 Chapter 8 Itercoectio Networks ad Clusters Term collocatio site commuicatio subets credit-based flow cotrol cut-through routig destiatio-based routig determiistic routig ed systems ed-to-ed argumet Etheret fat tree FC-AL frequecy-divisio multiplexig full duplex header host hub Ifiibad efiitio A warehouse for remote hostig of servers with expasible etworkig, space, coolig, ad security Aother ame for itercoectio etwork To reduce overhead for flow cotrol, a seder is give a credit to sed up to N packets, ad oly checks for etwork delays whe the credit is spet The switch examies the header, decides where to sed the message, ad the start trasmittig it immediately without waitig for the rest of the message. Whe the head of the message blocks, the message stays strug out over the etwork. The message cotais a destiatio address, ad the switch picks a path to deliver the message, ofte by table lookup Router always picks the same path for the message Aother ame for itercoectio etwork ode as opposed to the itermediate switches Itermediate fuctios (error checkig, performace optimizatio, ad so o) may be icomplete as compared to performig the fuctio ed-to ed The most popular LAN, it has scaled from its origial 3 bits/secod rate usig shared media i 1975 to switched media at 1000 bits/secod i 2001; it shows o sigs of stoppig a etwork topology with extra liks at each level ehacig a simple tree, so badwidth betwee each level is ormally costat (see Figure 8.14 o page 595) Fibre Chael Arbitrated Loop; a SAN for storage devices ivide the badwidth of the trasmissio lie ito a fixed umber of frequecies, ad assig each frequecy to a coversatio. Two-way commuicatio o a etwork segmet The first part of a message that cotais o user iformatio, but cotets helps that etwork, such as providig the destiatio address Aother ame for itercoectio etwork ode A OSI layer 1 etworkig device that coects multiples LANs to act as oe A emergig stadard SAN for both storage ad systems i a machie room FIGURE 8.2 Networkig terms i this chapter ad their defiitios

6 8.1 Itroductio 567 Term iterferece iteretworkig I iscsi LAN message multimode fiber multipath fadig multistage switch efiitio I wireless etworks, reductio of sigal due to frequecy reuse; frequecy is reused to try to icrease the umber of simultaeous coversatios over a large area Coectio of two or more itercoectio etworks Iteret rotocol is a OSI layer 3 protocol, at the etwork layer SCSI over I etworks, it is a competitor to SANs usig I ad Etheret switches Local Area Network, for machies i a buildig or campus, such as Etheret The smallest piece of electroic mail set over a etwork A iexpesive optical fiber that reduces badwidth ad distace for cost I wireless etworks, iterferece betwee multiple versios of sigal that arrive at differet times, determied by time betwee fastest sigal ad slowest sigal relative to sigal badwidth a switch cotaiig may smaller switches that perform a portio of routig OSI layer Ope System Itercoect models the etwork as seve layers (see 8.25 o page 612 ) overhead packet switchig payload peer-to-peer protocol peer-to-peer wireless protocol rack uit receiver overhead router SAN I this chapter, etworkig overhead is seder overhead + receiver overhead + time of flight I cotrast to circuit switchig, iformatio is broke ito packets (usually fixed or maximum size), each with it s ow destiatio address, ad they are routed idepedetly The middle part of the message that cotais user iformatio Commuicatio betwee two odes occurs logically at the same level of the protocol Istead of commuicatig to base statios, peer-to-peer wireless etworks commuicate betwee hadsets The sequece of steps that etwork software follows to commuicate A R.U. is 1.7 iches, the height of a sigle slot i a stadard 19-ich VE rack; there are 44 R.U. i stadard 6-foot rack The time for the processor to pull the message from the itercoectio etwork OSI layer 3 etworkig device that coects multiples LANs with icompatible addresses Origially System Area Network but more recetly Storage Area Network, it coects computers ad/or storage devices i a machie room. FC-AL or Ifiibad are SANs. FIGURE 8.2 Networkig terms i this chapter ad their defiitios

7 568 Chapter 8 Itercoectio Networks ad Clusters Term seder overhead shadow fadig sigal-to-oise ratio simplex sigle-mode fiber source-based routig store-ad-forward TC throughput time of flight trailer trasmissio time trasport latecy twisted pairs virtual circuit WAN wavelegth divisio multiplexig widow wireless etwork wormhole routig efiitio The time for the processor to iject the message ito the etwork; the processor is busy for the etire time I wireless etworks, whe the received sigal is blocked by objects; buildigs outdoors or walls idoors SNR, the ratio of the stregth of the sigal carryig iformatio to the backgroud oise Oe-way commuicatio o a etwork segmet Sigle-wavelegth fiber is arrower ad more expesive tha multimode fiber but it offers greater badwidth ad distace The message specifies the path to the destiatio at each switch Each switch waits for the full message to arrive before it is set o to the ext switch Trasmissio Cotrol rotocol, it is a OSI layer 4 protocol (trasport layer) I etworkig, measured speed of the medium or etwork badwidth delivered to a applicatio; i.e., does ot give credit for headers ad trailers The time for the first bit of the message to arrive at the receiver The last part of a message that has o user iformatio but helps the etwork, such as error correctio code The time for the message to pass through the etwork (ot icludig time of flight) Time that the message speds i the itercoectio etwork (icludig time of flight) Two wires twisted together to reduce electrical iterferece A logical circuit is established betwee source ad destiatio for a message to follow Wide Area Network, a etwork across a cotiet, such as AT W seds differet streams simultaeously o the same fiber usig differet wavelegths of light ad the demultiplexes the differet wavelegths at the receiver I TC, the umber of TC datagrams that ca be set without waitig for approval A etwork that commuicates without physical coectios, such as radio The switch examies the header, decides where to sed the message, ad the starts trasmittig it immediately without waitig for the rest of the message. The tail cotiues whe the head blocks, potetially compressig the strug-out message ito a sigle switch FIGURE 8.2 Networkig terms i this chapter ad their defiitios

8 8.1 Itroductio 569 Let s start with the geeric types of itercoectios. epedig o the umber of odes ad their proximity, these itercoectios are give differet ames: Wide area etwork (WAN) Also called log haul etwork, the WAN coects computers distributed throughout the world. WANs iclude thousads of computers, ad the maximum distace is thousads of kilometers. AT is a curret example of a WAN. Local area etwork (LAN) This device coects hudreds of computers, ad the distace is up to a few kilometers. Ulike a WAN, a LAN coects computers distributed throughout a buildig or o a campus. The most popular ad edurig LAN is Etheret. Storage or System area etwork (SAN) This itercoectio etwork is for a machie room, so the maximum distace of a lik is typically less tha 100 meters, ad it ca coect hudreds of odes. Today SAN usually meas Storage area etwork as it coects computers to storage devices, such as disk arrays. Origially SAN meat a System area etwork to coect computers together, such as Cs i a cluster. A recet SAN tryig to etwork both storage ad system is Ifiibad. Figure 8.3 shows the rough relatioship of these systems i terms of umber autoomous systems coected, icludig a bus for compariso. Note the area of overlap betwee buses, SANs, ad LANs, which lead to product competitio. WAN/Iteret LAN Bus SAN Number of Autoomous Systems Coected FIGURE 8.3 Relatioship of four types of itercoects i terms of umber of autoomous systems coected: bus, system or storage area etwork, local area etwork, ad wide area etwork/iteret. Note that there are overlappig rages where buses, SANs, ad LANs compete. Some supercomputers have a switch-based custom etwork to itercoect up to thousads of computers; such itercoects are basically custom SANs.

9 570 Chapter 8 Itercoectio Networks ad Clusters These three types of itercoectio etworks have bee desiged ad sustaied by several differet cultures Iteret, telecommuicatios, workgroup/ eterprise, storage, ad high performace computig each usig its ow dialects ad its ow favorite approaches to the goal of itercoectig autoomous computers. This chapter gives a commo framework for evaluatig all itercoectio etworks, usig a sigle set of terms to describe the basic alteratives. Figure 8.22 i sectio 8.7 gives several other examples of each of these itercoectio etworks. As we shall see, some compoets are commo to all types ad some are quite differet. We begi the chapter i sectio 8.2 by explorig the desig ad performace of a simple etwork to itroduce the ideas. We the cosider the followig problems: which media to use as the itercoect (8.3), how to coect may computers together (8.4 ad 8.5), ad what are the practical issues for commercial etworks (8.6). We follow with examples illustratig the trade-offs for each type of etwork (8.7), explore iteretworkig (8.8), ad cross cuttig issues for etworks (8.9). With this getle itroductio to etworks i sectios 8.2 to 8.9, readers iterested i more depth should try the suggested readig i sectio Sectios 8.10 to 8.12 switch to clusters, ad give a more quatitative descriptio with desigs ad examples. Sectio 8.13 gives a view of etworks from the embedded perspective, usig a cell phoe ad wireless etworks as the example. We coclude i sectios 8.14 to 8.16 with the traditioal edig of the chapters. As we shall see, etworkig shares more characteristics with storage tha with processors ad memory. Like storage, the operatig system cotrols what features of the etwork are used. Agai like storage, performace icludes both latecy ad badwidth, ad queueig theory is a valuable tool. Like RAI, etworkig assumes failures occur, ad thus depedability i the presece of errors is the orm. 8.2 A Simple Network There is a old etwork sayig: Badwidth problems ca be cured with moey. Latecy problems are harder because the speed of light is fixed you ca t bribe God. Aoymous To explai the complexities ad cocepts of etworks, this sectio describes a simple etwork of two computers. We the describe the software steps for these two machies to commuicate. The remaider of the sectio gives a detailed ad the a simple performace model, icludig several examples to see the implicatios of key etwork parameters.

10 8.2 A Simple Network 571 Suppose we wat to coect two computers together. Figure 8.4 shows a simple model with a uidirectioal wire from machie A to machie B ad vice versa. At the ed of each wire is a first-i-first-out (FIFO) queue to hold the data. I this simple example, each machie wats to read a word from the other s memory. A message is the iformatio set betwee machies over a itercoectio etwork. achie A achie B FIGURE 8.4 A simple etwork coectig two machies. For oe machie to get data from the other, it must first sed a request cotaiig the address of the data it desires from the other ode. Whe a request arrives, the machie must sed a reply with the data. Hece, each message must have at least 1 bit i additio to the data to determie whether the message is a ew request or a reply to a earlier request. The etwork must distiguish betwee iformatio eeded to deliver the message, typically called the header or the trailer depedig o where it is relative to the data, ad the payload, which cotais the data. Figure 8.5 shows the format of messages i our simple etwork. This example shows a sigle-word payload, but messages i some itercoectio etworks ca iclude hudreds of words. Itercoectio etworks ivolve ormally software. Eve this simple example ivokes software to traslate requests ad replies ito messages with the appropriate headers. A applicatio program must usually cooperate with the operatig system to sed a message to aother machie, sice the etwork will be shared with all the processes ruig o the two machies, ad the operatig system caot allow messages for oe process to be received by aother. Thus, the messagig software must have some way to distiguish betwee processes; this distictio may be icluded i a expaded header. Although hardware support ca reduce the amout of work, most is doe by software. I additio to protectio, etwork software is ofte resposible for esurig reliable delivery of messages. The twi resposibilities are esurig that the message is either garbled or lost i trasit.

11 572 Chapter 8 Itercoectio Networks ad Clusters Header (1 bit) ayload (32 bits) 0 Address 1 ata 0= Request 1 = Reply FIGURE 8.5 essage format for our simple etwork. essages must have extra iformatio beyod the data. Addig a checksum field (or some other error detectio code) to the message format meets the first resposibility. This redudat iformatio is calculated whe the message is first set ad checked upo receipt. The receiver the seds a ackowledgmet if the message passes the test. Oe way to meet the secod resposibility is to have a timer record the time each message is set ad to presume the message is lost if the timer expires before a ackowledgmet arrives. The message is the re-set. The software steps to sed a message are as follows: 1. The applicatio copies data to be set ito a operatig system buffer. 2. The operatig system calculates the checksum, icludes it i the header or trailer of the message, ad the starts the timer. 3. The operatig system seds the data to the etwork iterface hardware ad tells the hardware to sed the message. essage receptio is i just the reverse order: 3. The system copies the data from the etwork iterface hardware ito the operatig system buffer. 2. The system calculates the checksum over the data. If the checksum matches the seder s checksum, the receiver seds a ackowledgmet back to the seder. If ot, it deletes the message, assumig that the seder will resed the message whe the associated timer expires. 1. If the data pass the test, the system copies the data to the user s address space ad sigals the applicatio to cotiue. The seder must still react to the ackowledgmet:

12 8.2 A Simple Network 573 Whe the seder gets the ackowledgmet, it releases the copy of the message from the system buffer. If the seder gets the time-out istead of a ackowledgmet, it reseds the data ad restarts the timer. Here we assume that the operatig system keeps the message i its buffer to support retrasmissio i case of failure. Figure 8.6 shows how the message format looks ow. Header (2 bits) ayload (32 bits) Trailer (4 bits) (Checksum) ata 00 = Request 01 = Reply 10 = Ackowledge request 11 = Ackowledge reply FIGURE 8.6 trailer. essage format for our simple etwork. Note that the checksum is i the The sequece of steps that software follows to commuicate is called a protocol ad geerally has the symmetric but reversed steps betwee sedig ad receivig. Note that this example protocol above is for sedig a sigle message. Whe a applicatio does ot require a respose before sedig the ext message, the seder ca overlap the time to sed with the trasmissio delays ad the time to receive. A protocol must hadle may more issues tha reliability. For example, if two machies are from differet maufacturers, they might order bytes differetly withi a word (see sectio 2.3 of Chapter 2). The software must reverse the order of bytes i each word as part of the delivery system. It must also guard agaist the possibility of duplicate messages if a delayed message were to become ustuck. It is ofte ecessary to deliver the messages to the applicatio i the order they are set, ad so sequece umbers may be added to the header to eable assembly. Fially, it must work whe the receiver s FIFO becomes full, suggestig feedback to cotrol the flow of messages from the seder (see sectio 8.4). Now that we have covered the steps i sedig ad receivig a message, we ca discuss performace. Figure 8.7 shows the may performace parameters of itercoectio etworks. This figure is critical to uderstadig etwork performace, so study it

13 574 Chapter 8 Itercoectio Networks ad Clusters Seder Seder overhead Trasmissio time (bytes/badwidth) Receiver Time of flight Trasmissio time (bytes/badwidth) Receiver overhead Trasport latecy Total latecy Time FIGURE 8.7 erformace parameters of itercoectio etworks. epedig o whether it is a SAN, LAN, or WAN, the relative legths of the time of flight ad trasmissio may be quite differet from those show here. (Based o a presetatio by Greg apadopolous of Su icrosystems.) well! Note that the parameters i Figure 8.7 apply to the itercoect i may levels of the system: iside a chip, betwee chips o a board, betwee computers i a cluster, ad so o. The uits chage, but the priciples remai the same, as does the badwidth that results. These terms are ofte used loosely, leadig to cofusio, so we defie them here precisely: Badwidth We use this most widely used term to refer to the maximum rate at which the etwork ca propagate iformatio oce the message eters the etwork. Ulike disks, badwidth icludes the headers ad trailers as well as the payload, ad the uits are traditioally bits/secod rather tha bytes/secod. The term badwidth is also used to mea the measured speed of the medium or etwork badwidth delivered to a applicatio. Throughput is sometimes used for this latter term. Time of flight The time for the first bit of the message to arrive at the receiver, icludig the delays due to repeaters or other hardware i the etwork. Time of flight ca be millisecods for a WAN or aosecods for a SAN. Trasmissio time The time for the message to pass through the etwork, ot icludig time of flight. Oe way to measure it is the differece i time betwee whe the first bit of the message arrives at the receiver ad whe the last bit of the message arrives at the receiver. Note that by defiitio trasmissio time is equal to the size of the message divided by the badwidth. This measure assumes there are o other messages to coted for the etwork.

14 8.2 A Simple Network 575 Trasport latecy The sum of time of flight ad trasmissio time. Trasport latecy is the time that the message speds i the itercoectio etwork. Stated alteratively, it is the time betwee whe the first bit of the message is ijected ito the etwork ad whe the last bit of the message arrives at the receiver. It does ot iclude the overhead of ijectig the message ito the etwork or pullig it out whe it arrives. Seder overhead The time for the processor to iject the message ito the etwork, icludig both hardware ad software compoets. Note that the processor is busy for the etire time, hece the use of the term overhead. Oce the processor is free, ay subsequet delays are cosidered part of the trasport latecy. For pedagogic reasos, we assume overhead is ot depedet o message size. (Typically, oly very large messages have larger overhead.) Receiver overhead The time for the processor to pull the message from the itercoectio etwork, icludig both hardware ad software compoets. I geeral, the receiver overhead is larger tha the seder overhead: for example, the receiver may pay the cost of a iterrupt. The total latecy of a message ca be expressed algebraically: essage size Total latecy = Seder overhead + Time of flight Receiver overhead Badwidth Let s look at how the time of flight ad overhead parameters chage i importace as we go from SAN to LAN to WAN. EXALE Assume a etwork with a badwidth of 1000 bits/secod has a sedig overhead of 80 microsecods ad a receivig overhead of 100 microsecods. Assume two machies. Oe wats to sed a byte message to the other (icludig the header), ad the message format allows bytes i a sigle message. Let s compare SAN, LAN, ad WAN by chagig the distace betwee the machies. Calculate the total latecy to sed the message from oe machie to aother i a SAN assumig they are 10 meters apart. Next, perform the same calculatio but assume the machies are ow 500 meters apart, as i a LAN. Fially, assume they are 1000 kilometers apart, as i a WAN. ANSWER The speed of light is 299,792.5 kilometers per secod i a vacuum, ad sigals propagate at about 63% to 66% of the speed of light i a coductor. Sice this is a estimate, i this chapter we ll roud the speed of light to 300,000 kilometers per secod, ad assume we ca achieve two-thirds of that i a coductor. Hece, we ca estimate time of flight. Let s plug the parameters for the short distace of a SAN ito the formula above:

15 576 Chapter 8 Itercoectio Networks ad Clusters essage size Total latecy = Seder overhead + Time of flight Receiver overhead Badwidth 0.01km bytes = 80µsecs , 000 km/sec µsecs bits/sec Covertig all terms ito microsecods (µsecs) leads to Total latecy = = = µsecs , 000 µsecs µsecs 100 µsecs µsecs µsecs + 80 µsecs µsec = µsecs 260µsecs Substitutig a example LAN distace ito the third equatio yields Total latecy = = = 0.5km bytes 80µsecs , 000 km/sec µsecs bits/sec 80 µsecs µsecs + 80 µsecs µsec = µsecs 262µsecs Substitutig the WAN distace ito the equatio yields 1000 km bytes Total latecy = 80µsecs , 000 km/sec µsecs bits/sec = 80 µsecs µsecs + 80 µsecs µsec = µsecs = 5260µsecs The icreased fractio of the latecy required by time of flight for log distaces, as well as the greater likelihood of errors over log distaces, are why wide area etworks use more sophisticated ad time-cosumig protocols. Complexity icreases from protocols used o a bus versus a LAN versus the Iteret as we go from te to hudreds to thousads of odes. Note that messages i LANs ad WANs go through switches which add to the latecy, which we eglected above. Geerally, switch latecy is small compared to overhead i LANs or time of flight i SANs. As metioed above, whe a applicatio does ot require a respose before sedig the ext message, the seder ca overlap the sedig overhead with the trasport latecy ad receiver overhead. Icreased latecy affects the structure of programs that try to hide this latecy, requirig quite differet solutios if the latecy is 1, 100, or 10,000 microsecods.

16 8.2 A Simple Network 577 Note that the example above shows that time of flight for SANs is so short relative to overhead that it ca be igored, yet i WANs, time of flight is so log that seder ad receiver overheads ca be igored. Thus, we ca simplify the performace equatio by combiig seder overhead, receiver overhead, ad time of flight ito a sigle term called Overhead: essage size Total latecy Overhead Badwidth We ca use this formula to calculate the effective badwidth delivered by the etwork as message size varies: essage size Effective badwidth = Total latecy Let s use this simpler equatio to explore the impact of overhead ad message size o effective badwidth. EXALE lot the effective badwidth versus message size for overheads of 25 ad 250 microsecods ad for etwork badwidths of 100, 1000, ad bits/secod. Vary message size from 16 bytes to 4 megabytes. For what message sizes is the effective badwidth virtually the same as the raw etwork badwidth? If overhead is 250 microsecods, for what message sizes is the effective badwidth always less tha 100 bits/secod? ANSWER Figure 8.8 plots effective badwidth versus message size usig the simplified equatio above. The otatio ox,bwy meas a overhead of X microsecods ad a etwork badwidth of Y bits/secod. To amortize the cost of high overhead, message sizes must be four megabytes for effective badwidth to be about the same as etwork badwidth. Assumig the high overhead, message sizes about 3K bytes or less will ot break the 100 bits/secod barrier o matter what the actual etwork badwidth. Thus, we must lower overhead as well as icrease etwork badwidth uless messages are very large. Hece, message size is importat i gettig full beefit of fast etworks. What is the atural size of messages? Figure 8.9 above shows the size of Network File System (NFS) messages for 239 machies at Berkeley collected over a period of oe week. Oe plot is cumulative i messages set, ad the other is cumulative i data bytes set. The maximum NFS message size is just over 8 KB, yet 95% of the messages are less tha 192 bytes log..figure 8.10 below shows the similar results for Iteret traffic, where the maximum trasfer uit was 1500 bytes.

17 578 Chapter 8 Itercoectio Networks ad Clusters 10, ,000.0 Effective badwidth (bits/sec) o25,bw10000 o25,bw1000 o25,bw100 o250,bw10000 o250,bw1000 o250,bw K 4K 16K 64K 256K 1 4 essage size (bytes) FIGURE 8.8 Badwidth delivered versus message size for overheads of 25 ad 250 microsecods ad for etwork badwidths of 100, 1000, ad bits/secod. Note that with 250 microsecods of overhead ad a etwork badwidth of 1000 bits/secod, oly the 4-B message size gets a effective badwidth of 1000 bits/secod. I fact, message sizes must be greater tha 256 B for the effective badwidth to exceed 10 bits/secod. The otatio ox,bwy meas a overhead of X microsecods ad a etwork badwidth of Y bits/secod. <<Artist: label lies, drop leged.>> Agai, 60% of the messages are less tha 192 bytes log, ad 1500-byte messages represeted 50% of the bytes trasferred. ay applicatios sed far more small messages tha large messages, sice requests ad ackowledgemets are more frequet tha data Summarizig this sectio, eve this simple etwork has brought up the issues of protectio, reliability, heterogeeity, software protocols, ad a more sophisticated performace model. The ext four sectios address other key questios: Which media are available to coect computers together? What issues arise if you wat to coect more tha two computers? What practical issues arise for commercial etworks?

18 8.2 A Simple Network % 90% 80% essages 70% 60% Cumulative percetage 50% 40% 30% 20% ata bytes 10% % essage size (bytes) FIGURE 8.9 Cumulative percetage of messages ad data trasferred as message size varies for NFS traffic. Each x-axis etry icludes all bytes up to the ext oe; e.g., 32 represets 32 bytes to 63 bytes. ore tha half the bytes are set i 8-KB messages, but 95% of the messages are less tha 192 bytes. Figure 8.50 (page 651) shows details of this measuremet. Collected at the Uiversity of Califoria at Berkeley. Cumulative percetage 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% essages ata bytes essage Size FIGURE 8.10 Cumulative percetage of messages ad data trasferred as message size varies for Iteret traffic. About 40% of the messages were 40 bytes log, ad 50% of the data trasfer was i messages 1500 bytes log. The maximum trasfer uit of most switches was 1500 bytes. Collect by Ver axto o CI Iteret traffic i 1998.

19 580 Chapter 8 Itercoectio Networks ad Clusters 8.3 Itercoectio Network edia Just as there is a memory hierarchy, there is a hierarchy of media to itercoect computers that varies i cost, performace, ad reliability. Network media have aother figure of merit, the maximum distace betwee odes. This sectio covers three popular examples, ad Figure 8.11 illustrates them. Category 5 Usheilded Twisted pair ("Cat5"): Coaxial cable: lastic coverig Braided outer coductor Isulator Copper core Fiber optics: Trasmitter is LE or Laser dieoge Claddig Total iteral reflectio Receiver is hotodiode Light source Silica Core Claddig Buffer FIGURE 8.11 Three etwork media. (From a presetatio by avid Culler of U.C. Berkeley.) The first medium is twisted pairs of copper wires. These are two isulated wires, each about 1 mm thick. They are twisted together to reduce electrical iterferece, sice two parallel lies form a atea but a twisted pair does ot. As they ca trasfer a few megabits per secod over several kilometers without amplificatio, twisted pair were the maistay of the telephoe system. Telephoe compaies budled together (ad sheathed) may pairs comig ito a buildig. Twisted pairs ca also offer tes of megabits per secod of badwidth over shorter distaces, makig them plausible for LANs.

20 8.3 Itercoectio Network edia 581 The desire to go at higher speeds with the less expesive copper led to improvemets i the quality of ushielded twisted-pair copper cablig systems. The origial telephoe-lie quality was called Level 1. Level 3 was good eough for 10 bits/secod Etheret. The desire for eve greater badwidth lead to the Level 5 or Category 5, which is sufficiet for 100 bits/secod Etheret. By limitig the legth to 100 meters, Cat5 wirig ca be used for 1000 bits/secod Etheret liks today. It uses the RJ-45 coector, which is similar to the coector foud o telephoe lies. Coaxial cable was deployed by cable televisio compaies to deliver a higher rate over a few kilometers. To offer high badwidth ad good oise immuity, isulatig material surrouds a sigle stiff copper wire, ad the cylidrical coductor surrouds the isulator, ofte wove as a braided mesh. A 50-ohm basebad coaxial cable delivers 10 megabits per secod over a kilometer. Coectig to this heavily isulated media is more challegig. The origial techique was a T juctio: the cable is cut i two ad a coector is iserted that recoects the cable ad adds a third wire to a computer. A less ivasive solutio is a vampire tap: a hole of precise depth ad width is first drilled ito the cable, termiatig i the copper core. A coector is the screwed i without havig to cut the cable. To keep up with the demads of badwidth ad distace, it became clear that the telephoe compay would eed to fid ew media. The solutio could be more expesive provided that it offered much higher badwidth ad that supplies were pletiful. The aswer was to replace copper with glass ad electros with photos. Fiber optics trasmits digital data as pulses of light. A fiber optic etwork has three compoets: 1. the trasmissio medium, a fiber optic cable; 2. the light source, a LE or laser diode; 3. the light detector, a photodiode. First, claddig surrouds the glass fiber core to cofie the light. A buffer the surrouds the claddig to protect the core ad claddig. Note that ulike twisted pairs or coax, fibers are oe-way, or simplex, media. A two-way, or full duplex, coectio betwee two odes requires two fibers. Sice light beds or refracts at iterfaces, it ca slowly spread as it travels dow the cable uless the diameter of the cable is limited to oe wavelegth of light; the it trasfers i a straight lie. Thus, fiber optic cables are of two forms: 1. ultimode fiber It uses iexpesive LEs as a light source. It is typically much larger tha the wavelegth of light: typically 62.5 micros i diameter vs. the 1.3-micro wavelegth of ifrared light. Sice it is wider it has more dispersio problems, where some wave frequecies have differet propagatio velocities. The LEs ad dispersio limit it to up to a few hudred meters at 1000 bits/secod or a few kilometers at 100 bits /secod. It is older ad less expesive tha sigle mode fiber.

21 582 Chapter 8 Itercoectio Networks ad Clusters 2. Sigle-mode fiber This sigle-wavelegth fiber (typically 8 to 9 micros i diameter) requires more expesive laser diodes for light sources ad curretly trasmits gigabits per secod for hudreds of kilometers, makig it the medium of choice for telephoe compaies. The loss of sigal stregth as it passes through a medium, called atteuatio, limits the legth of the fiber. Although sigle-mode fiber is a better trasmitter, it is much more difficult to attach coectors to sigle-mode; it is less reliable ad more expesive, ad the cable itself has restrictios o the degree it ca be bet. The cost, badwidth, ad distace of sigle-mode fiber is affected by the power of the light source, the sesitivity of the light detector, ad the atteuatio rate per kilometer of the fiber cable. Typically, glass fiber has better characteristics tha the less expesive plastic fiber, ad so is more widely used. Coectig fiber optics to a computer is more challegig tha coectig cable. The vampire tap solutio of cable fails because it loses light. There are two forms of T-boxes: 1. Taps are fused oto the optical fiber. Each tap is passive, so a failure cuts off just a sigle computer. 2. I a active repeater, light is coverted to electrical sigals, set to the computer, coverted back to light, ad the set dow the cable. If a active repeater fails, it blocks the etwork. These taps ad repeaters also reduce optical sigal stregth, reducig the useful distace of a sigle piece of fiber. I both cases, fiber optics has the additioal cost of optical-to-electrical ad electrical-to-optical coversio as part of the computer iterface. Hece, the etwork iterface cards for fiber optics are cosiderably more expesive tha for Cat5 copper wire. I 2001, most switches for fiber ivolve such a coversio to allow switchig, although expesive all optical switches are begiig to be available. To achieve eve more badwidth from a fiber, wavelegth divisio multiplexig (W) seds differet streams simultaeously o the same fiber usig differet wavelegths of light, ad the demultiplexes the differet wavelegths at the receiver. I 2001, W ca deliver a combied 40 Gbits/secod usig about 8 wavelegths, with plas to go to 80 wavelegths ad deliver 400 Gbits/secod. The product of the badwidth ad maximum distace forms a sigle figure of merit: gigabit-kilometers per secod. Accordig to esurvire [1992], sice 1975 optical fibers have icreased trasmissio capacity by tefold every four years by this measure. Let s compare media i a example.

22 8.4 Coectig ore Tha Two Computers 583 EXALE Suppose you have 25 magetic tapes, each cotaiig 40 GB. Assume that you have eough tape readers to keep ay etwork busy. How log will it take to trasmit the data over a distace of oe kilometer? Assume the choices are Category 5 twisted pair wires at 100 bits/secod, multimode fiber at 1000 bits/secod, ad sigle mode fiber at 2500 bits/ secod. How do they compare to deliverig the tapes by car? ANSWER The amout of data is 1000 GB. The time for each medium is give below: Twisted pair ultimode fiber Sigle-mode fiber b = = 81,920 secs = 22.8 hours 100 b/sec b = = 8192 secs = 2.3 hours 1000 b/sec b = = 3277 secs = 0.9 hours œ 2500 b/sec Car = Time to load car + Trasport time + Time to uload car 1 km = 300 secs secs = 300 secs secs secs kph = 720 secs = 0.3 hours A car filled with high-desity tapes is a high-badwidth medium! 8.4 Coectig ore Tha Two Computers Computer power icreases by the square of the umber of odes o the etwork. Robert etcalf ( etcalf s Law ) Thus far, we have discussed two computers commuicatig over private lies, but what makes itercoectio etworks iterestig is the ability to coect hudreds of computers together. Ad what makes them more iterestig also makes them more challegig to build. Shared versus Switched edia Certaily the simplest way to coect multiple computers is to have them share a sigle itercoectio medium, just as I/O devices share a sigle I/O bus. The most popular LAN, Etheret, origially was simply a bus shared by a hudred of computers. Give that the medium is shared, there must be a mechaism to coordiate ad arbitrate the use of the shared medium so that oly oe message is set at a time.

23 584 Chapter 8 Itercoectio Networks ad Clusters If the etwork is small, it may be possible to have a additioal cetral arbiter to give permissio to sed a message. (Of course, this leaves ope the questio of how the odes talk to the arbiter.) Cetralized arbitratio is impractical for etworks with a large umber of odes spread out over a kilometer, so we must distribute arbitratio. A first step towards arbitratio is lookig before you leap. A ode first checks the etwork to avoid tryig to sed a message while aother message is already o the etwork. If the itercoectio is idle, the ode tries to sed. Lookig first is ot a guaratee of success, of course, as some other ode may decide to sed at the same istat. Whe two odes sed at the same time, it is called a collisio. Let s assume that the etwork iterface ca detect ay resultig collisios by listeig to hear if the data were garbled by other data appearig o the lie. Listeig to avoid ad detect collisios is called carrier sesig ad collisio detectio. This is the secod step of arbitratio. The problem is ot solved. If every ode o the etwork waited exactly the same amout of time, listeed to be sure there was o traffic, ad the tried to sed agai, we could still have sychroized odes that would repeatedly bump heads. To avoid repeated head-o collisios, each ode whose message was garbled waits (or backs off ) a radom time before resedig. Note that radomizatio breaks the sychroizatio. Subsequet collisios result i expoetially icreasig time betwee attempts to retrasmit, so as ot to tax the etwork. Although this approach is ot guarateed to be fair some subsequet ode may trasmit while those that collided are waitig it does cotrol cogestio o the shared medium. If the etwork does ot have high demad from may odes, this simple approach works well. Uder high utilizatio, performace degrades sice the medium is shared. Aother approach to arbitratio is to pass a toke betwee the odes, with the toke givig the ode the right to use the etwork. If the shared media is coected i a rig, the the toke ca rotate through all the odes o the rig. Shared media have some of the same advatages ad disadvatages as buses: they are iexpesive, but they have limited badwidth. Ad like buses, they must have a arbitratio scheme to solve coflictig demads. The alterative to sharig the media is to have a dedicated lie to a switch that i tur provides a dedicated lie to all destiatios. Figure 8.12 shows the potetial badwidth improvemet of switches: Aggregate badwidth is may times that of a sigle shared medium. Switches allow commuicatio directly from source to destiatio, without itermediate odes to iterfere with these sigals. Such poit-to-poit commuicatio is faster tha a lie shared betwee may odes because there is o arbitratio ad the iterface is simpler electrically. Of course, it does pay the added latecy of goig through the switch, tradig off arbitratio overhead for switchig overhead. Give the obvious advatages, why were t switches always used? Earlier computers were much slower ad so could share media. I additio, applicatios

24 8.4 Coectig ore Tha Two Computers 585 Shared media (Etheret) Node Node Node Switched media (AT) Node Node Switch Node Node FIGURE 8.12 Shared medium versus switch. Etheret was origially a shared medium, ad but Etheret switches are ow available. All odes o the shared media must share the 100 b/sec itercoectio, but switches ca support multiple 100 b/sec trasfers simultaeously. Low cost Etheret switches are sometimes implemeted with a iteral bus with higher badwidth, but high-speed switches have a cross-bar itercoect. such as the World Wide Web rely o the etwork much more tha older applicatios. Fially, earlier switches would take several large boards, ad be as a large as a computer. I 2001, a sigle chip cotais a full 64-by-64 switch, or at least a large slice of it. oore s Law is makig switches more attractive, ad so techology treds favor switches today. Every ode of a shared lie will see every message, eve if it is just to check to see whether or ot the message is for that ode, so this style of commuicatio is sometimes called broadcast to cotrast it with poit-to-poit. The shared medium makes it easy to broadcast a message to every ode, ad eve to broadcast to subsets of odes, called multicastig. Switches allow multiple pairs of odes to commuicate simultaeously, givig these itercoectios much higher aggregate badwidth tha the speed of a shared lik to a ode. Switches also allow the itercoectio etwork to scale to a very large umber of odes. Switches are also called data switchig exchages, multistage itercoectio etworks, or eve iterface message processors (Is). epedig o the distace of the ode to the switch ad desired badwidth, the etwork medium is either copper wire or optical fiber.

25 586 Chapter 8 Itercoectio Networks ad Clusters EXALE Compare 16 odes coected three ways: a sigle 100 b/sec shared media; a switch coected via Cat5, each segmet ruig at 100 b/ sec; ad a switch coected via optical fibers, each ruig at 1000 b/ sec. The shared media is 500 meters log, ad the average legth of each segmet to a switch is 50 meters. Both switches ca support the full badwidth. Assume each switch adds 5 microsecods to the latecy. Calculate the aggregate badwidth ad trasport latecy. Assume the average message size is 125 bytes, ad igore the overhead of sedig or receivig a message ad cotetio for the etwork. ANSWER The aggregate badwidth of each example is the simplest calculatio: 100 b/sec for the shared media; , or 1600 b/sec for the switched twisted pairs; ad , or b/sec for the switched optical fibers. The trasport time is Trasport time = essage size Time of flight Badwidth For coax we just plug i the distace, badwidth, ad message size: 500/ Trasport time = shared , 000 µsecs µsecs = 2.5 µsecs + 10 µsecs = 12.5 µsecs For the switches, the distace is twice the average segmet, sice there is oe segmet from the seder to the switch ad oe from the switch to the receiver. We must also add the latecy for the switch. 50/ Trasport time = swtich µsecs 5 µsecs 3 300, µsecs = 0.5 µsecs + 5 µsecs + 10 µsecs = 15.5 µsecs 50/ Trasport time = fiber µsecs 5 µsecs 3 300, µsecs = 0.5 µsecs + 5 µsecs + 1 µsecs = 6.5 µsecs Although the badwidth of the switch is may times the shared media, the latecy for uloaded etworks is comparable.

26 8.4 Coectig ore Tha Two Computers 587 Switches allow commuicatio to harvest the same rapid advace from silico as have processors ad mai memory. Whereas the switches from telecommuicatios compaies were oce the size of maiframe computers, today we see sigle-chip switches. Just as sigle-chip processors led to processors replacig logic i a surprisig umber of places, sigle-chip switches are icreasigly replacig buses ad shared media etworks. Coectio-Orieted versus Coectioless Commuicatio Before computers arrived o the scee, the telecommuicatios idustry allowed commuicatio aroud the world. A operator set up a coectio betwee a caller ad a callee, ad oce the coectio is established, a coversatio ca cotiue for hours. To share trasmissio lies over log distaces, the telecommuicatios idustry used switches to multiplex several coversatios o the same lies. Sice audio trasmissios have relatively low badwidth, the solutio was to divide the badwidth of the trasmissio lie ito a fixed umber of frequecies, with each frequecy assiged to a coversatio. This techique is called frequecy-divisio multiplexig. Although a good match for voice, frequecy-divisio multiplexig is iefficiet for sedig data. The problem is that the frequecy chael is dedicated to the coversatio whether or ot there is aythig beig said. Hece, the log distace lies are busy based o the umber of coversatios, ad ot o the amout of iformatio beig set at a particular time. A alterative style of commuicatio is called coectioless, where each package is routed to the destiatio by lookig at its address. The postal system is a good example of coectioless commuicatio. Closely related to the idea of coectio versus coectioless commuicatio are the terms circuit switchig ad packet switchig. Circuit switchig is the traditioal way to offer a coectio-based service. A circuit is established from source to destiatio to carry the coversatio, reservig badwidth util the circuit is broke. The alterative to circuit-switched trasmissio is to divide the iformatio ito packets, or frames, with each packet icludig the destiatio of the packet plus a portio of the iformatio. Queuig theory i sectio 6.4 tells us that packets caot use all of the badwidth, but i geeral, this packetswitched approach allows more use of the badwidth of the medium ad is the traditioal way to support coectioless commuicatio. EXALE Let s compare a sigle 1000 bits/sec packet switched etwork with te 100 bits/sec packet-switched etworks. Assume that the mea size of a packet is 250 bytes, the arrival rate is 250,000 packets per secod, ad the iterarrival times are expoetially distributed. What is the mea respose time for each alterative? What is the ituitive reaso behid the differece?

27 588 Chapter 8 Itercoectio Networks ad Clusters ANSWER From sectio 6.4 i the prior chapter, we ca use a //1 queue to calculate the mea respose time for the sigle fast etwork: Badwidth Service rate = essage = size = = 500,000 packets per secod Time server = 500, = 2 µsecs Arrival rate 250,000 Utilizatio = Service = rate 500, = 0.5 Server utilizatio Time queue = Time server ( = 2 µsecs 1 Server utilizatio) = = 2 µsecs ea respose time = Time queue + Time server = 2+ 2 = 4 µsecs The 10 slow etworks ca be modeled by a //m queue, ad the appropriate formulas are foud i sectio 6.7: Service rate = = = 50,000 packets per secod Time server = 50, = secs = 20 µsecs Arrival rate 250, ,000 Utilizatio = m = Service rate = , = 0.5 Server utilizatio Time queue = Time server = 20 µsecs m ( 1 Server utilizatio) = 2 ( 1 0.5) = 2 µsecs ea respose time = Time queue + Time server = = 22 µsecs The ituitio is clear from the results: the service time is much less for the faster etwork eve though the queuig times are the same. This ituitio is the argumet for statistical multiplexig usig packets; queuig times are ot worse for a sigle faster etwork, ad the latecy for a sigle packet is much less. Stated alteratively, you get better latecy whe you use a uloaded fast etwork, ad data traffic is bursty so it works. Although coectios traditioally alig with circuit switchig, providig the user with the appearace of a logical coectio o top of a packet-switched etwork is certaily possible. TC/I, as we shall see i sectio 8.8, is a coectioorieted service that operates over packet-switched etworks.

28 8.4 Coectig ore Tha Two Computers 589 Routig: eliverig essages Give that the path betwee odes may be difficult to avigate depedig upo the topology, the system must be able to route the message to the desired ode. Shared media has a simple solutio: The message is broadcast to all odes that share the media, ad each ode looks at a address withi the message to see whether the message is for that ode. This routig also made it easy to broadcast oe message to all odes by reservig oe address for everyoe; broadcast is much harder to support i switch-based etworks. Switched media use three solutios for routig. I source-based routig, the message specifies the path to the destiatio. Sice the etwork merely follows directios, it ca be simpler. A secod alterative is the virtual circuit, whereby a circuit is established betwee source ad destiatio, ad the message simply ames the circuit to follow. AT uses virtual circuits. The third approach is a destiatio-based routig, where the message merely cotais a destiatio address, ad the switch must pick a path to deliver the message. I uses destiatio routig. Hece, AT switches are simpler coceptually; oce a virtual circuit is established, packet switchig is very fast. O the other had, I routers must decide how to route every packet it receives by doig a routig table lookup o every packet. estiatio-based routig may be determiistic ad always follow the same path, or it may be adaptive, allowig the etwork to pick differet routes to avoid failures or cogestio. Closely related to adaptive routig is radomized routig, whereby the etwork will radomly pick betwee several equally good paths to spread the traffic throughout the etwork, thereby avoidig hot spots. Switches i WANs route messages usig a store-ad-forward policy; each switch waits for the full message to arrive i the switch before it is set o to the ext switch. Geerally store-ad-forward ca retry a message withi the etwork i case of failure. The alterative to store-ad-forward, available i some SANs, is for the switch to examie the header, decide where to sed the message, ad the start trasmittig it immediately without waitig for the rest of the message. It requires retrasmissio from the source o a failure withi the etwork. This alterative is called either cut-through routig or wormhole routig. I wormhole routig, whe the head of the message is blocked, the message stays strug out over the etwork, potetially blockig other messages. Cut-through routig lets the tail cotiue whe the head is blocked, compressig the strugout message ito a sigle switch. Clearly, cut-through routig requires a buffer large eough to hold the largest packet, while wormhole routig eeds oly to buffer the piece of the packet set betwee switches. The advatage of both cut-through ad wormhole routig over store-adforward is that latecy reduces from a fuctio of the umber of itermediate switches multiplied by the size of the packet to the time for the first part of the packet to egotiate the switches plus the trasmissio time.

29 590 Chapter 8 Itercoectio Networks ad Clusters EXALE The C-5 supercomputer used wormhole routig, with each switch buffer beig just 4 bits per port. Compare efficiecy of store-ad-forward versus wormhole routig for a 128-ode machie usig a C-5 itercoectio sedig a 16-byte payload. Assume each switch takes 0.25 microsecods ad that the trasfer rate is 20 Bytes/sec. ANSWER The C-5 itercoectio for 128 odes is hierarchy (see Figure 8.14 o page 595), ad a message goes through seve itermediate switches. Each C-5 packet has four bytes of header iformatio, so the legth of this packet is 20 bytes. The time to trasfer 20 bytes over oe C-5 lik is The the time for store ad forward is (Switches Switch delay) + ((Switches + 1) Trasfer time) = (7 0.25) + (8 1) = 9.75 µsecs while wormhole routig is = B/sec 1 µsec (Switches Switch delay) + Trasfer time = (7 0.25) + 1 = 2.75 µsecs For this example, wormhole routig improves latecy by more tha a factor of three. A fial routig issue is the order i which packets arrive. Some etworks require that packets arrive i the order set. The alterative removes this restrictio, requirig software to reassemble the packets i proper order. Cogestio Cotrol Oe advatage of a circuit-switched etwork is that oce a circuit is established, it esures there is sufficiet badwidth to deliver all the iformatio set alog that circuit. oreover, switches alog a path ca be requested to give specific quality of service guaratees. Thus, itercoectio badwidth is reserved as circuits are established rather tha cosumed as data are set, ad if the etwork is full, o more circuits ca be established. You may have ecoutered this blockage whe tryig to place a log distace phoe call o a popular holiday or to a televisio show, as the telephoe system tells you that all circuits are busy ad asks you to please call back at a later time. acket-switched etworks geerally do ot reserve itercoect badwidth i advace, so the itercoectio etwork ca become clogged with too may packets. Just as with rush hour traffic, a traffic jam of packets icreases packet latecy. ackets take loger to arrive, ad i extreme cases fewer packets per secod are delivered by the itercoect, just as is the case for the poor rush-hour commuters. There is eve the computer equivalet of gridlock: deadlock is

30 8.4 Coectig ore Tha Two Computers 591 achieved whe packets i the itercoect ca make o forward progress o matter what sequece of evets happes. Chapter 6 addresses how to avoid this ultimate cogestio i the cotext of a multiprocessor. Higher badwidth ad loger distace etworks exacerbate these problems, as this example illustrates. EXALE Assume a 155 bits/sec etwork stretchig from Sa Fracisco to New York City. How may bytes will be i flight? What is the umber if the etwork is upgraded to 1000 bits/sec? ANSWER Use the prior assumptios ad speed of light. The distace betwee Sa Fracisco ad New York City is 4120 km. Calculatig time of flight: Time of flight 4120 km = = 3 300, 000 km/sec secs Let s assume the etwork delivers 50% of the peak badwidth. The umber of bytes i trasit o a 155 bits/sec etwork is Bytes i trasit = elivered badwidth Time of Flight bits/sec = secs = 9.7 B/sec secs 8 = 0.200B At 1000 bits/sec the umber is bits/sec Bytes i trasit = secs = 62.5 B/sec secs 8 = 1.718B ore tha a megabyte of messages will be a challege to cotrol ad to store i the etwork. The solutio to cogestio is to prevet ew packets from eterig the etwork util traffic reduces, just as meterig lights guardig o-ramps cotrol the rate of cars eterig a freeway. There are three basic schemes used for cogestio cotrol i computer itercoectio etworks, each with its ow weakesses: packet discardig, flow cotrol, ad choke packets. The simplest, ad most callous, is packet discardig. If a packet arrives at a switch ad there is o room i the buffer, the packet is discarded. This scheme relies o higher-level software that hadles errors i trasmissio to resed lost packets. Iteretworkig protocols such as U discard packets.

31 592 Chapter 8 Itercoectio Networks ad Clusters The secod scheme is to rely o flow cotrol betwee pairs of receivers ad seders. The idea is to use feedback to tell the seder whe it is allowed to sed the ext packet. Oe versio of feedback is via separate wires betwee adjacet seders ad receivers that tell the seder to stop immediately whe the receiver caot accept aother message. This backpressure feedback is rapidly set back to the origial seder over dedicated lies, causig all liks betwee the two ed poits to be froze util the receiver ca make room for the ext message. Backpressure flow cotrol is commo i supercomputer etworks, SANs ad eve some gigabit Etheret switches which sed fake collisio sigal to cotrol flow. A more sophisticated variatio of feedback is for the ultimate destiatio to give the origial seder a credit to sed packets before gettig permissio to sed more. These are geerically called credit-based flow cotrol. A widow is oe versio of credit-based flow cotrol. The widow s size determies the miimum frequecy of commuicatio from receiver to seder. The goal of the widow is to sed eough packets to overlap the latecy of the itercoectio with the overhead to sed ad receive a packet. The TC protocol uses a widow. This brigs us to a poit of cofusio o termiology i may papers ad textbooks. Note that flow cotrol describes just two odes of the itercoectio ad ot the total itercoectio etwork betwee all ed systems. Cogestio cotrol refers to schemes that reduce traffic whe the collective traffic of all odes is too large for the etwork to hadle. Hece, flow cotrol helps cogestio cotrol, but it is ot a uiversal solutio. Choke packets are basis of the third scheme. The observatio is that you oly wat to limit traffic whe the etwork is cogested. The idea is for each switch to see how busy it is, eterig a warig state whe it passes a threshold. Each packet received by the switch i a warig state are set back to the source via a choke packet that icludes the iteded destiatio. The source is expected to reduce traffic to that destiatio by a fixed percetage. Sice it likely will have already set may packets alog that path, it waits for all the packets i trasit to be retured before takig choke packets seriously. 8.5 Network Topology The umber of topologies described i publicatios would be difficult to cout, but the umber that have bee used commercially is just a hadful, with desigers of parallel supercomputers beig the most visible ad imagiative. They have used regular topologies to simplify packagig ad scalability. The topologies of SANS, LANs ad WANs are more haphazard, havig more to do with the challeges of log distace or simply the coectio of equipmet purchased over several years. Topology matters less today tha it did i the past. You do t wat to rewrite your applicatio for each ew topology, but you would like the system to take advatage of locality that aturally occurs i programs.

32 8.5 Network Topology 593 Cetralized Switch Figure 8.13 illustrates two of the popular switch orgaizatios, with the path from ode 0 to ode 6 show i gray i each topology. A fully coected, or crossbar, itercoectio allows ay ode to commuicate with ay other ode i oe pass through the itercoectio. Routig depeds o the style of addressig. I source-based routig, the message icludes a sequece of out-boud arcs to reach a destiatio. Oce a outgoig arc is picked, that portio of the routig a. Cross bar b. Omega etwork A B C c. Omega etwork switch box FIGURE 8.13 opular switch topologies for eight odes. The liks are uidirectioal; data come i at the left ad exit out the right lik. The switch box i (c) ca pass A to C ad B to or B to C ad A to. The crossbar uses 2 switches, where is the umber of processors, while the Omega etwork uses /2 log 2 of the large switch boxes, each of which is logically composed of four of the smaller switches. I this case the crossbar uses 64 switches versus 12 switch boxes or 48 switches i the Omega etwork. The crossbar, however, ca simultaeously route ay permutatio of traffic patter betwee processors. The Omega etwork caot.

33 594 Chapter 8 Itercoectio Networks ad Clusters sequece may be dropped from the packet. I destiatio-based routig, a table decides which port to take for a give address. Some etworks will ru programs i the switches ( spaig tree protocols ) to geerate the routig table o the fly oce the etwork is coected. The Iteret does somethig similar for routig. A Omega itercoectio uses less hardware tha the crossbar itercoectio (/2 log 2 vs. 2 switches), but cotetio is more likely to occur betwee messages. The amout of cotetio depeds o the patter of commuicatio. This form of cotetio is called blockig. For example, i the Omega itercoectio i Figure 8.13 a message from 1 to 7 blocks while waitig for a message from 0 to 6. Of course, if two odes try to sed to the same destiatio both 0 ad 1 sed to 6 there will be cotetio for that lik, eve i the crossbar.routig i a Omega et ca uses the same techiques as i a full-crossbar. A tree is the basis of aother switch, with badwidth added higher i the tree to match the requiremets of commo commuicatios patters. Figure 8.14 shows this topology, called a fat tree. Itercoectios are ormally draw as graphs, with each arc of the graph represetig a lik of the commuicatio itercoectio, with odes show as black squares ad switches show as shaded circles. To double the umber of odes i a fat tree, we just add aother level to the top of the tree. Notices that this also icreases the badwidth at the top of the tree, which is a advatage of a fat tree. This figure shows that there are multiple paths betwee ay two odes i a fat tree. For example, betwee ode 0 ad ode 8 there are four paths. Such redudacy ca help with fault tolerace. I additio, if messages are radomly assiged to differet paths, the this should spread the load throughout the switch ad result i fewer cogestio problems. Thus far, the switch is separate from the processor ad memory, ad assumed to be located i a cetral locatio. Lookig iside this switch, we see may smaller switches. The term multistage switch is sometimes used to refer to cetralized uits to reflect the multiple steps that a message may travel before it reaches a computer. istributed Switch Istead of cetralizig these small switchig elemets, a alterative is to place oe small switch at every computer, yieldig a distributed switchig fuctio. Give a distributed switch, the questio is how to coect the switches together. Figure 8.15 shows that a low-cost alterative to full itercoectio is a etwork that coects a sequece of odes together. This topology is called a rig. Sice some odes are ot directly coected, some messages will have to hop alog itermediate odes util they arrive at the fial destiatio. Ulike shared lies, a rig is capable of may simultaeous trasfers: the first ode ca sed to the secod at the same time as the third ode ca sed to the fourth, for example. Rigs

34 8.5 Network Topology FIGURE 8.14 A fat-tree topology for 16 odes. The shaded circles are switches, ad the squares at the bottom are processor-memory odes. A simple 4-ary tree would oly have the liks at the frot of the figure; that is, the tree with the root labeled 0,0. This three-dimesioal view suggests the icrease i badwidth via extra liks at each level over a simple tree, so badwidth betwee each level of a fat tree is ormally costat rather tha reduced by a factor of four as i a 4-ary tree. ultiple paths ad radom routig give it the ability to route commo patters well, which esures o sigle patter from a broad class of commuicatio patters will do badly. I the C-5 fat-tree implemetatio, the switches have four dowward coectios ad two or four upward coectios; i this figure the switches have two upward coectios. FIGURE 8.15 A rig etwork topology.

35 596 Chapter 8 Itercoectio Networks ad Clusters are ot quite as good as this souds because the average message must travel through /2 switches, where is the umber of odes. To first order, a rig is like a pipelied bus: o the plus side are poit-to-poit liks, ad o the mius side are bus repeater delays. Oe variatio of rigs used i local area etworks is the toke rig. To simplify arbitratio, a sigle slot, or toke, goes aroud the rig to determie which ode is allowed to sed a message. A ode ca sed oly whe it gets the toke. (A toke is simply a special bit patter.) I this sectio we evaluate the rig as a topology with more badwidth tha a bus, eglectig its advatages i arbitratio. A straightforward but expesive alterative to a rig is to have a dedicated commuicatio lik betwee every elemet of a distributed switch. The tremedous improvemet i performace of fully coected switches is offset by the eormous icrease i cost, typically goig up with the square of the umber of odes. This cost ispires desigers to ivet ew topologies that are betwee the cost of rigs ad the performace of fully coected etworks. The evaluatio of success depeds i large part o the ature of the commuicatio i the itercoectio etwork. Real machies frequetly add extra liks to these simple topologies to improve performace ad reliability. Figure 8.16 illustrates three popular topologies for high performace computers with distributed switches. Oe popular measure for itercoectios, i additio to the oes covered i sectio 8.2, is the bisectio badwidth. This measure is calculated by dividig the itercoect ito two roughly equal parts, each with half the odes. You the sum the badwidth of the lies that cross that imagiary dividig lie. For fully coected itercoectios the bisectio badwidth is proportioal to (/2) 2, where is the umber of odes. For a bus, bisectio badwidth is just the speed of oe lik. Sice some itercoectios are ot symmetric, the questio arises as to where to draw the imagiary lie whe bisectig the itercoect. Bisectio badwidth is a worst-case metric, so the aswer is to choose the divisio that makes itercoectio performace worst. Stated alteratively, calculate bisectio badwidths for all pairs of equal-sized parts, ad pick the smallest. Figure 8.17 summarizes these differet topologies usig bisectio badwidth ad the umber of liks for 64 odes. EXALE A commo commuicatio patter i scietific programs is to cosider the odes as elemets of a two-dimesioal array ad the have commuicatio to the earest eighbor i a give directio. (This patter is sometimes called NEWS commuicatio, stadig for orth, east, west, ad south, the directios o the compass.) ap a eight-by-eight array oto the 64 odes i each topology, ad assume every lik of every itercoect is the same speed. How log does it take each ode to sed oe message to its orther eighbor ad oe to its easter eighbor? Igore odes that have o orther or easter eighbors.

36 8.5 Network Topology 597 a. 2 grid or mesh of 16 odes b. 2 torus of 16 odes c. Hypercube tree of 16 odes (16 = 2 4 so = 4) FIGURE 8.16 Network topologies that have appeared i commercial supercomputers. The shaded circles represet switches, ad the black squares represet odes. Eve though a switch has may liks, geerally oly oe goes to the ode. Frequetly these basic topologies are supplemeted with extra arcs to improve performace ad reliability. For example, coectig the switches i the left ad right colums of the 2 grid usig the uused ports o each switch forms a 2 torus. The Boolea hypercube topology is a -dimesioal itercoect for 2 odes, requirig ports per switch (plus oe for the processor), ad thus earest eighbor odes. ANSWER I this case, we wat to sed 2 (64 8), or 112, messages. Here are the cases, agai i icreasig order of difficulty of explaatio: Bus The placemet of the eight-by-eight array makes o differece for the bus, sice all odes are equally distat. The 112 trasfers are doe sequetially, takig 112 time uits. Fully coected Agai the odes are equally distat; all trasfers are doe i parallel, takig oe time uit. Rig Here the odes are differig distaces. Assume the first row

37 598 Chapter 8 Itercoectio Networks ad Clusters Evaluatio category Bus Rig 2 torus 6-cube Fully coected erformace Bisectio badwidth Cost orts per switch Total umber of lies NA FIGURE 8.17 Relative cost ad performace of several itercoects for 64 odes. The bus is the stadard referece at uit cost, ad of course there ca be more tha oe data lie alog each lik betwee odes. Note that ay etwork topology that scales the bisectio badwidth liearly must scale the umber of itercoectio lies faster tha liearly. Figure 8.13a o page 593 is a example of a fully coected etwork. of the array is placed o odes 0 to 7, the secod row o odes 8 to 15, ad so o. It takes just oe time uit to sed to the easter eighbor, for this is a sed from ode to ode + 1. I this scheme the orther eighbor is exactly eight odes away, so it takes eight time uits for each ode to sed to its orther eighbor. The rig total is ie time uits. 2 torus There are eight rows ad eight colums i our grid of 64 odes, which is a perfect match to the NEWS commuicatio. It takes just two time uits to sed to the orther ad easter eighbors. 6-cube It is possible to place the array so that it will take just two time uits for this commuicatio patter, as i the case of the torus. This simple aalysis of itercoectio etworks i this sectio igores several importat practical cosideratios i the costructio of a itercoectio etwork. First, these three-dimesioal drawigs must be mapped oto chips, boards, ad cabiets that are essetially two-dimesioal media, ofte tree-like. For example, due to the fixed height of cabiets, a -ode Itel arago used a rectagular grid rather tha the ideal of. Aother cosideratio is the iteral speed of the switch: if it is fixed, the more liks per switch meas lower badwidth per lik, potetially affectig the desirability of differet topologies. Yet aother cosideratio is that the latecy through a switch depeds o the complexity of the routig patter, which i tur depeds o the topology. Fially, the badwidth from the processor is ofte the limitig factor: if there is oly oe port i ad out of the processor, the it ca oly sed or receive oe message per time uit regardless of the techology. Topologies that appear elegat whe sketched o the blackboard may look awkward whe costructed from chips, cables, boards, ad boxes. The bottom

38 8.5 Network Topology 599 lie is that quality of implemetatio matters more tha topology. To put these topologies i perspective, Figure 8.18 lists those used i commercial high performace computers. Istitutio Thikig achies Name C-2 Number of odes 1024 to 4096 Basic topology ata bits/lik Network clock rate eak BW/lik (B/sec) Bisectio (B/sec) Year 12-cube 1 7 Hz Itel elta grid Hz Thikig achies C-5 32 to 2048 ultistage fat tree 4 40 Hz 20 10, Itel arago 4 to grid Hz IB S-2 2 to 512 ultistage 8 40 Hz 40 20, fat tree Cray Research T3E 16 to torus ? Hz Itel ASCI Red 4536 (x 2 CUS) IB SGI IB ASCI Blue acific ASCI Blue outai ASCI Blue Horizo 1336 (x 4 CUS) 1464 (x 2 CUS) 144 (x 8 CUs) IB S 1 to 512 (x 2 to 16 CUs) IB ASCI White 484 (x 16 CUs) , Grid , Fat Hypercube ultistage Omega ultistage Omega ultistage Omega x odes FIGURE 8.18 Characteristics of itercoectios of some commercial supercomputers. The bisectio badwidth is for the largest machie. The 2 grid of the Itel elta is 16 rows by 35 colums ad the ASCI Red is 38 rows by 32 colums. The fat-tree topology of the C-5 is restricted i the lower two levels, hece the lower badwidth i the bisectio. Note that the Cray T3 has two processors per ode ad the Itel arago has from two to four processors per ode. Oce agai the issues discussed i this sectio apply at may levels, from iside a chip to a coutry-sized WAN. The redudacy of a topology matter so that the etwork ca survive despite failures. This is true withi a switch as well, so that a sigle chip failure eed ot lead to switch failure. It also must be true for a WAN, so that a sigle backhoe caot take dow the etwork of a coutry. The switch the depeds o the implemetatio techology ad the demads of the applicatio: it is a multistage etwork whose topology ca be aythig from a bus to Omega etwork.

39 600 Chapter 8 Itercoectio Networks ad Clusters 8.6 ractical Issues for Commercial Itercoectio Networks There are practical issues i additio to the techical issues described so far that are importat cosideratios for some itercoectio etworks: coectivity, stadardizatio, ad fault tolerace. Coectivity The umber of machies that commuicatio affects the complexity of the etwork ad its protocols. The protocols must target the largest size of the etwork, ad hadle the types of aomalous evets that occur. Hudreds of machies commuicatig are a much easier tha millios. Coectig the Network to the Computer Where the etwork attaches to the computer affects both the etwork iterface hardware ad software. Questios iclude whether to use the memory bus or the I/O bus, whether to use pollig or iterrupts, ad how to avoid ivokig the operatig system. The etwork iterface is the ofte the etwork bottleeck. Computers have a hierarchy of buses with differet cost/performace. For example, a persoal computer i 2001 has a memory bus, a CI bus for fast I/O devices, ad a USB bus for slow I/O devices. I/O buses follow ope stadards ad have less striget electrical requiremets. emory buses, o the other had, provide higher badwidth ad lower latecy tha I/O buses. Where to coect the etwork to the machie depeds o the performace goals ad whether you hope to buy a stadard etwork iterface card or are willig to desig or buy oe that oly works with the memory bus o your model of computer. A few SAN plugs ito the memory bus, but most SANs ad all LANs ad WANs plug ito the I/O bus. The locatio of the etwork coectio sigificatly affects the software iterface to the etwork as well as the hardware. As metioed i sectio 6.6, oe key is whether the iterface is coheret with the processor s caches: the seder may have to flush the cache before each sed, ad the receiver may have to flush its cache before each receive to prevet the stale data problem. Such flushes icrease sed ad receive overhead. A memory bus is more likely to be cache-coheret tha a I/O bus ad therefore more likely to avoid these extra cache flushes. A related questio of where to coect to the computer is how to coect to the software: o you use programmed I/O or direct memory access (A) to sed a message? (See sectio 6.6.) I geeral, A is the best way to sed large

40 8.6 ractical Issues for Commercial Itercoectio Networks 601 messages. Whether to use A to sed small messages depeds o the efficiecy of the iterface to the A. The A iterface is usually memory-mapped, ad so each iteractio is typically at the speed of mai memory rather tha of a cache access. If A setup takes may accesses, each ruig at ucached memory speeds, the the seder overhead may be so high that it is faster to simply sed the data directly to the iterface. Stadardizatio: Cross-Compay Iteroperability Stadards are useful i may places i computer desig, but with itercoectio etworks they are ofte critical. Advatages of successful stadards iclude low cost ad stability. The customer has may vedors to choose from, which both keeps price close to cost due to competitio. It makes the viability of the itercoectio idepedet of the stability of a sigle compay. Compoets desiged for a stadard itercoectio may also have a larger market, ad this higher volume ca lower the vedor s costs, further beefitig the customer. Fially, a stadard allows may compaies to build products with iterfaces to the stadard, so the customer does ot have to wait for a sigle compay to develop iterfaces to all the products the customer might be iterested i. Oe drawback of stadards is the time it takes for committees to agree o the defiitio of stadards, which is a problem whe techology is chagig quickly. Aother problem is whe to stadardize: o oe had, desigers would like to have a stadard before aythig is built; o the other, it would be better if somethig is built before stadardizatio to avoid legislatig useless features or omittig importat oes. Whe doe too early, it is ofte doe etirely by committee, which is like askig all of the chefs i Frace to prepare a sigle dish of food; masterpieces are rarely served. Stadards ca also suppress iovatio at that level, sice the stadard fixes iterfaces. LANs ad WANs use stadards ad iteroperate effectively. WANs ivolve may types of compaies ad must coect to may brads of computers, so it is difficult to imagie a proprietary WAN ever beig successful. The ubiquitous ature of the Etheret shows the popularity of stadards for LANs as well as WANs, ad it seems ulikely that may customers would tie the viability of their LAN to the stability of a sigle compay. Alas, some SANs are stadardized yet switches from differet compaies do ot iteroperate, ad some iteroperate as well as LANs ad WANs. essage Failure Tolerace Although some hardware desigers try to build fault free etworks, i practice it is oly a questio of the rate of faults, ot whether you ca prevet them. Thus, the commuicatio system must have mechaisms for retrasmissio of a message i case of failure. Ofte it is hadled i higher layers of the software protocol at the ed poits, requirig retrasmissio at the source. Give the log time of flight for WANs, ofte they ca retrasmit from hop to hop rather relyig oly o retrasmissio from the source.

41 602 Chapter 8 Itercoectio Networks ad Clusters Node Failure Tolerace The secod practical issue refers to whether or ot the itercoectio relies o all the odes beig operatioal i order for the itercoectio to work properly. Sice software failures are geerally much more frequet tha hardware failures, the questio is whether a software crash o a sigle ode ca prevet the rest of the odes from commuicatig. Clearly, WANs would be useless if they demaded that thousads of computers spread across a cotiet be cotiuously available, ad so they all tolerate the failures of idividual odes. LANs coect dozes to hudreds of computers together, ad agai it would be impractical to require that o computer ever fail. All successful LANs ormally survive ode failures. Although most SANs have the ability to work aroud failed odes ad switches, it is ot clear that all commuicatio layer software supports this feature. Typically, low latecy schemes sacrifice fault tolerace. EXALE Figure 8.19 shows the umber of failures of 58 desktop computers o a local area etwork for a period of just over oe year. Suppose that oe local area etwork is based o a etwork that requires all machies to be operatioal for the itercoectio etwork to sed data; if a ode crashes, it caot accept messages, so the itercoectio becomes choked with data waitig to be delivered. A alterative is the traditioal local area etwork, which ca operate i the presece of ode failures; the itercoectio simply discards messages for a ode that decides ot to accept them. Assumig that you eed to have both your workstatio ad the coectig LAN to get your work doe, how much greater are your chaces of beig preveted from gettig your work doe usig the failure-itolerat LAN versus traditioal LANs? Assume the dow time for a crash is less tha 30 miutes. Calculate usig the oe-hour itervals from this figure. ANSWER Assumig the umbers for Figure 8.19, the percetage of hours that you ca t get your work doe usig the failure-itolerat etwork is Itervals with failures Total itervals = = Total itervals Itervals o failures Total itervals = = 4.1% The percetage of hours that you ca t get your work doe usig the traditioal etwork is just the time your workstatio has crashed. If these failures are equally distributed amog workstatios, the percetage is Failures/achies Total itervals 654/58 = = = 0.13%

42 8.6 ractical Issues for Commercial Itercoectio Networks 603 Failed machies per time iterval Oe-hour itervals with umber of failed machies i first colum Total failures per oe-hour iterval Oe-day itervals with umber of failed machies i first colum Total failures per oe-day iterval Total FIGURE 8.19 easuremet of reboots of 58 ECstatio 5000s ruig Ultrix over a 373-day period. These reboots are distributed ito time itervals of oe hour ad oe day. The first colum sorts the itervals accordig to the umber of machies that failed i that iterval. The ext two colums cocer oe-hour itervals, ad the last two colums cocer oe-day itervals. The secod ad fourth colums show the umber of itervals for each umber of failed machies. The third ad fifth colums are just the product of the umber of failed machies ad the umber of itervals. For example, there were 50 occurreces of oe-hour itervals with two failed machies, for a total of 100 failed machies, ad there were 35 days with two failed machies, for a total of 70 failures. As we would expect, the umber of failures per iterval chages with the size of the iterval. For example, the day with 31 failures might iclude oe hour with 11 failures ad oe hour with 20 failures. The last row shows the total umber of each colum: the umber of failures does t agree because multiple reboots of the same machie i the same iterval do ot result i separate etries. (Rady Wag of U.C. Berkeley collected these data.) Hece, you are more tha 30 times more likely to be preveted from gettig your work doe with the failure-itolerat LAN tha with the traditioal LAN, accordig to the failure statistics i Figure Stated alteratively, the perso resposible for maitaiig the LAN would receive a thirtyfold icrease i phoe calls from irate users!

43 604 Chapter 8 Itercoectio Networks ad Clusters Oe practical issue ties to ode failure tolerace: If the itercoectio ca survive a failure, ca it also cotiue operatio while a ew ode is added to the itercoectio? If ot, each additio of a ew ode disables the itercoectio etwork. isablig is impractical for both WANs ad LANs. Fially, we have bee discussig the ability of the etwork to operate i the presece of failed odes. Clearly as importat to the happiess of the etwork admiistrator is the reliability of the etwork media ad switches themselves, for their failure is certai to frustrate much of the user commuity. 8.7 Examples of Itercoectio Networks To further uderstad these issues, we look at te desig decisios o the topics we covered so far usig examples from LAN, SAN, ad WAN: What is the target badwidth? What is the message format? Which media are used? Is the etwork shared or switched? Is it coectio-orieted or coectioless? oes it use store-ad-forward or cut-through routig? Is routig use source-based, destiatio-based, or virtual-circuit based? What is used for cogestio cotrol? What topologies are supported? oes it follow a stadard? Etheret: The Local Area Network The first example is the Etheret. It has bee extraordiarily successful, with the 10 bits/secod stadard proposed i 1978 used practically everywhere. I 2001, the 100 bits/secod stadard proposed i 1994 is closig i popularity. ay classes of computers iclude Etheret as a stadard iterface. This packetswitched etwork is coectioless, ad it routes usig the destiatio address. Figure 8.20 shows the packet formats for Etheret, as well as the other two examples. Etheret is codified as IEEE stadard esiged origially for co-axial cable, today Etherets are primarily Cat5 copper wire, with optical fiber reserved for loger distaces ad higher badwidths. There is eve a wireless versio, which is testimoy to its ubiquity. Computers became thousads of times faster tha they were i 1978 ad the shared itercoectio was o faster for almost 20 years. Hece, past egieers

44 8.7 Examples of Itercoectio Networks 605 Iifiibad Etheret AT Versio estiatio reamble estiatio T Legth Source reamble Checksum T artitio Key estiatio estiatio Queue estiatio Source Sequece Number Source Type Legth ata (48) ata (0-1500) ata (0-4096) 32 bits ad (0-46) Checksum Checksum 32 bits Checksum 32 bits FIGURE 8.20 acket format for Ifiibad, Etheret, ad AT. AT calls their messages cells istead of packets, so the proper ame is AT cell format. The width of each drawig is 32 bits. All three formats have destiatio addressig fields, ecoded differetly for each situatio. All three also have a checksum field to catch trasmissio errors, although the AT checksum field is calculated oly over the header; AT relies o higher-level protocols to catch errors i the data. Both Ifiibad ad Etheret have a legth field, sice the packets hold a variable amout of data, with the former couted i 32- bit words ad the latter i bytes.ifiibad ad AT headers have a type field (T) that gives the type of packet. The remaiig Etheret fields are a preamble to allow the receiver to recover the clock from the self-clockig code used o the Etheret, the source address, ad a pad field to make sure the smallest packet is 64 bytes (icludig the header). Ifiibad icludes a versio field for protocol versio, a sequece umber to allow i-order delivery, a field to select the destiatio queue, ad a partitio key field.ifiibad has may more small fields ot show ad may other packet formats; above is a simplified view. AT s short packet, fixed is a good match to real-time demad of digital voice

45 606 Chapter 8 Itercoectio Networks ad Clusters iveted temporary solutios util a faster Etheret was available. Oe solutio was to use multiple Etherets to coect machies, ad to coect these smaller Etherets with devices that ca take traffic from oe Etheret ad pass it o to aother as eeded. These devices allow idividual Etherets to operate i parallel, thereby icreasig the aggregate itercoectio badwidth of a collectio of computers. I effect these devices provide similar fuctioality to the switches described above for poit-to-poit etworks. Sigle Etheret: 1 packet at a time Node Node Node Node Node Node Node Node Node Node Node ultiple Etherets: ultiple packets at a time Node Node Node Node Node Node Bridge Bridge Node Node Node Node Node FIGURE 8.21 The potetial icreased badwidth of usig may Etherets ad bridges. Figure 8.21 shows the potetial parallelism. epedig o how they pass traffic ad what kids of itercoectios they ca put together, these devices have differet ames: Bridges These devices coect LANs together, passig traffic from oe side to aother depedig o the addresses i the packet. Bridges operate at the Etheret protocol level ad are usually simpler ad cheaper tha routers, discussed ext. Usig the otatio of the OSI model described i the ext sectio (see Figure 8.25 o page 612), bridges operate at layer 2, the data lik layer. Routers or gateways These devices coect LANs to WANs or WANs to WANs, ad resolve icompatible addressig. Geerally slower tha bridges, they operate at OSI layer 3, the etwork layer. Routers divide the itercoect

46 8.7 Examples of Itercoectio Networks 607 ito separate smaller subets, which simplifies maageability ad improves security. The fial etwork devices are hubs, but they merely exted multiple segmets ito a sigle LAN. Thus, hubs do ot help with performace, as oly oe message ca trasmit at a time. Hubs operate at OCI layer 1, the physical layer. Sice these devices were ot plaed as part of the Etheret stadard, their ad hoc ature has added to the difficulty ad cost of maitaiig LANs. I 2001, Etheret lik speed is available at 10, 100, ad 1000 bits/secod, with bits per secod likely available i 2002 to Although 10 ad 100 bits/sec ca share the media with multiple devices, 1000 bits/secod ad above relies o poit-to-poit liks ad switches. Etheret switches ormally use cut-through routig. ue to its age, Etheret has o real flow cotrol. It origially used carrier sesig with expoetial back-off (see page 584) to arbitrate for the shared media. Some switches try to use that iterface to retrofit their versio of flow cotrol, but flow cotrol is ot part of a Etheret stadard. Storage Area Network: Ifiibad A SAN that tries to optimize based o shorter distaces is Ifiibad. This ew stadard has clock rates of 2.5 GHz ad ca trasmit data at a peak speed of 2000 bits/secod per lik. These poit-to-poit liks ca be budled together i groups of 4 to 12 to give 4 to 12 times the badwidth per lik. Like Etheret, it is a packet switched, coectioless etwork. It also relies oly o switches, as does gigabit Etheret, ad also uses cut-through routig ad destiatio-based addressig. The distaces are much shorter tha Etheret, with category 5 wire limited to 17 meters ad optical fiber limited to 100 meters. It uses backpressure for flow cotrol (see page 592). Whe goig to storage, it relies o the SCSI commad set. Although it is ot a traditioal stadard, a trade orgaizatio of cooperatig compaies is resposible for Ifiibad. Give the similarities, why does oe eed a separate stadard for a storage area etwork versus a local area etwork? The storage commuity believes a SAN has differet emphasis from a LAN. First, protocol overhead is much lower for a SAN. A gigabit per secod LAN ca fully occupy a 0.8 to 1.0 GHz CU whe ruig TC/I (see page 653). The Ifibad protocol, o the other had, places a very light load o the host computer. The reaso is a cotroller o the Ifiibad etwork iterface card that offloads the processig from the host comptuer. Secod, protectio is much more importat i the LAN tha the SAN. The SAN is for data oly, ad is behid the server. From a SAN perspective, the server is like a firewall for the SAN, ad hece the SAN is ot required to provide protectio. Third, storage desigers thik that graceful behavior uder cogestio is critical for SANs. The lack of flow cotrol i Etheret ca lead to a lack of grace uder pressure. TC/I copes with cogestio by droppig packets, but storage applicatios do ot appreciate dropped packets.

47 608 Chapter 8 Itercoectio Networks ad Clusters Not surprisigly, the LAN advocates have a respose. First, Etheret switches are less costly tha SAN switches due to greater competitio i the marketplace. Secod, sice Iteret rotocol (I) etworks are aturally large, they eable replicatio of data to geographically diverse sites of the Iteret. This geographical advatage both protects agaist disasters ad offers a alterative to tape backup. Thus far, SANs have bee relatively small, both i umber of odes ad physical distace. Fially, although TC/I does have overhead, to try to preserve server utilizatio, TC/I off-loadig egies are appearig i the marketplace. Some LAN advocates are embracig a stadard called iscsi, which exports ative SCSI commads over I etworks. The operatig system itercepts SCSI commads, ad repackages ad seds them i a TC/I message. At the receivig ed, it upacks messages ito SCSI commads ad issues them locally. isc- SI allows a compay to sed SCSI commads ad data over its iteral WAN or, if trasmitted over the Iteret, to locatios with Iteret access. Wide Area Network: AT Asychroous Trasfer ode (AT) is latest of the ogoig stadards set by the telecommuicatios idustry. Although it flirted as competitio to Etheret as a LAN i the 1990s, today AT has retreated to its WAN stroghold. The telecommuicatios stadard has scalable badwidth built i. It starts at 155 bits/secod, ad scales by factors of four to 620 bits/secod, 2480 bits per secod, ad so o. Sice it is a WAN, AT s media is fiber, both sigle mode ad multimode. Although it is a switched media, ulike the other examples, it relies o coectios for commuicatio. AT uses virtual chaels for routig to multiplex differet coectios o a sigle etwork segmet, thereby avoidig the iefficiecies of covetioal coectio-based etworkig. The WAN focus also leads to store-ad-forward routig. Ulike the other protocols, Figure 8.20 shows AT has a small, fixed sized packet. (For those curious to the selectio of a 48-byte payload, see Sectio 8.16.) It uses a credit-based flow cotrol scheme (see page 592). The reaso for coectios ad small packets is quality of service. Sice the telecommuicatios idustry is cocer about voice traffic, predictability matters as well as badwidth. Establishig a coectio has less variability tha coectioless etworkig, ad it simplifies store ad forward routig. The small, fixed packet also makes it simpler to have fast routers ad switches. Towards that goal, AT eve offers its ow protocol stack to compete with TC/I. Surprisigly, eve though the switches are simple, the AT suite of protocols is large ad complex. The dream was a seamless ifrastructure from LAN to WAN, avoidig the hodge-podge of routers commo today. That dream has faded from ispiratio to ostalgia.

48 8.7 Examples of Itercoectio Networks 609 Legth (meters) Number data lies Clock rate (Hz) 10-b Etheret 500/ 2500 LAN SAN WAN 100-b Etheret 1000-b Etheret FC-AL Ifiibad yriet AT / /100 10/550/ /1 2 1, 4, or 12? /622 Switch? Optioal Optioal Yes Optioal Yes Yes Yes Nodes edia Copper Copper Copper/ fiber eak lik BW (bits/sec) Topology Store & forward? Cogestio cotrol Stadard Copper/ fiber Copper/fiber , 8000, or Lie or Star Lie or Star Star Rig or Star Copper/ m.m./ s.m.fiber 1300 to 2000 Star Star Star Yes Yes Yes Yes Yes Yes No est. based est. based est. based Coectioless? Routig estiatio based estiatio based est. based Copper/fiber 155/622/... Virtual circuit No No No No No No Yes Carrier sese IEEE Carrier sese IEEE Carrier sese IEEE 802.3ab ANSI Task Group X3T11 Ifiibad Trade Associatio Creditbased Backpressure Backpressure ANSI/ VITA Credit based AT Forum FIGURE 8.22 Several examples of SAN, LAN, ad WAN itercoectio etworks. FC-AL is a etwork for disks. Summary Figure 8.22 summarizes aswers to the te questios from the start of this sectio. It covers three example etworks covered here, plus a few other. This sectio shows how similar techology gets differet spis for differet cocers of LAN, SAN, ad WAN. Nevertheless, the iheret similarity leads to marketplace competitio. AT tried (ad failed) to usurp the LAN champioship from Etheret, ad i 2001 Etheret/iSCSI is tryig to compete with Fibre Chael Arbitrated Loop (FC-AL) ad Ifiibad for the SAN markets.

49 610 Chapter 8 Itercoectio Networks ad Clusters. 8.8 Iteretworkig Udoubtedly oe of the most importat iovatios i the commuicatios commuity has bee iteretworkig. It allows computers o idepedet ad icompatible etworks to commuicate reliably ad efficietly. Figure 8.23 illustrates the eed to cross etworks. It shows the etworks ad machies ivolved i trasferrig a file from Staford Uiversity to the Uiversity of Califoria at Berkeley, a distace of about 75 km. Iteret T3 lie Staford, Califoria fd-0.ess128.t3. as.et SU-C. BARRNet.et CIS-Gateway. Staford.edu mojave. Staford.edu FI FI Etheret T1 lie Berkeley, Califoria UCB1. BARRNet.et ir-108-eecs. Berkeley.edu ir-111-cs2. Berkeley.edu mammoth. Berkeley.edu Etheret FI Etheret FIGURE 8.23 The coectio established betwee mojave.staford.edu ad mammoth.berkeley.edu. (1995) FI is a 100 bits/sec LAN, while a T1 lie is a 1.5 bits/sec telecommuicatios lie ad a T3 is a 45 bits/sec telecommuicatios lie. BARRNet stads for Bay Area Research Network. Note that ir-111-cs2.berkeley.edu is a router with two Iteret addresses, oe for each port.

50 8.8 Iteretworkig 611 The low cost of iteretworkig is remarkable. For example, it is vastly less expesive to sed electroic mail tha to make a coast-to-coast telephoe call ad leave a message o a aswerig machie. This dramatic cost improvemet is achieved usig the same log-haul commuicatio lies as the telephoe call, which makes the improvemet eve more impressive. The eablig techologies for iteretworkig are software stadards that allow reliable commuicatio without demadig reliable etworks. The uderlyig priciple of these successful stadards is that they were composed as a hierarchy of layers, each layer takig resposibility for a portio of the overall commuicatio task. Each computer, etwork, ad switch implemets its layer of the stadards, relyig o the other compoets to faithfully fulfill their resposibilities. These layered software stadards are called protocol families or protocol suites. They eable applicatios to work with ay itercoectio without extra work by the applicatio programmer. Figure 8.24 suggests the hierarchical model of commuicatio. Applicatios Iteretworkig Networks FIGURE 8.24 The role of iteretworkig. The width idicates the relative umber of items at each level. The most popular iteretworkig stadard is TC/I, which stads for trasmissio cotrol protocol/iteret protocol. This protocol family is the basis of the humbly amed Iteret, which coects tes of millios of computers aroud the world. This popularity meas TC/I is used eve whe commuicatig locally across compatible etworks; for example, the etwork file system NFS uses I eve though it is very likely to be commuicatig across a homogeous LAN such as Etheret. We use TC/I as our protocol family example; other protocol families follow similar lies. Sectio 8.16 gives the history of TC/I. The goal of a family of protocols is to simplify the stadard by dividig resposibilities hierarchically amog layers, with each layer offerig services eeded by the layer above. The applicatio program is at the top, ad at the bottom is the physical commuicatio medium, which seds the bits. Just as abstract data types simplify the programmer s task by shieldig the programmer from details of the implemetatio of the data type, this layered strategy makes the stadard easier to uderstad.

51 612 Chapter 8 Itercoectio Networks ad Clusters There were may efforts at etwork protocols, which led to cofusio i terms. Hece, Ope Systems Itercoect (OSI) developed a model that popularized describig etworks as a series of layers. Figure 8.25 shows the model. Although all protocols do ot exactly follow this layerig, the omeclature for the differet layers is widely used. Thus, you ca hear discussios about a simple layer 3 switch versus a layer 7 smart switch. Layer umber Layer ame ai Fuctio Example Network compoet rotocol 7 Applicatio Used for applicatios specifically writte to ru over the etwork 6 resetatio Traslates from applicatio to etwork format, ad vice-versa 5 Sessio Establishes, maitais ad eds sessios across the etwork 4 Trasport Additioal coectio below the sessio layer 3 Network Traslates logical etwork address ad ames to their physical address (e.g., computer ame to AC address) 2 ata Lik Turs packets ito raw bits ad at the receivig ed turs bits ito packets 1 hysical Trasmits raw bit stream over physical cable FT, NS, NFS, http Named pipes, RC TC I Etheret IEEE 802 Gateway, smart switch Gateway Gateway Gateway Router, AT switch Bridge, Network Iterface Card Hub FIGURE 8.25 The OSI model layers. Based o / The key to protocol families is that commuicatio occurs logically at the same level of the protocol i both seder ad receiver, but services of the lower level implemet it. This style of commuicatio is called peer-to-peer. As a aalogy, imagie that Geeral A eeds to sed a message to Geeral B o the battlefield. Geeral A writes the message, puts it i a evelope addressed to Geeral B, ad gives it to a coloel with orders to deliver it. This coloel puts it i a evelope ad writes the ame of the correspodig coloel who reports to Geeral B, ad gives it to a major with istructios for delivery. The major does the same thig ad gives it to a captai, who gives it to a lieuteat, who gives it to a sergeat. The sergeat takes the evelope from the lieuteat, puts it ito a evelope with the ame of a sergeat who is i Geeral B s divisio, ad fids a private with orders to take the large evelope. The private borrows a motorcycle ad

52 8.8 Iteretworkig 613 delivers the evelope to the other sergeat. Oce it arrives, it is passed up the chai of commad, with each perso removig a outer evelope with his ame o it ad passig o the ier evelope to his superior. As far as Geeral B ca tell, the ote is from aother geeral. Neither geeral kows who was ivolved i trasmittig the evelope, or how it was trasported from oe divisio to the other. rotocol families follow this aalogy more closely tha you might thik, as Figure 8.26 shows. The origial message icludes a header ad possibly a trailer set by the lower-level protocol. The ext-lower protocol i tur adds its ow header to the message, possibly breakig it up ito smaller messages if it is too large for this layer. Reusig our aalogy, a log message from the geeral is divided ad placed i several evelopes if it could ot fit i oe. This divisio of the message ad appedig of headers ad trailers cotiues util the message desceds to the physical trasmissio medium. The message is the set to the destiatio. Each level of the protocol family o the receivig ed will check the message at its level ad peel off its headers ad trailers, passig it o to the ext higher level ad puttig the pieces back together. This estig of protocol layers for a specific message is called a protocol stack, reflectig the last-i-first-out ature of the additio ad removal of headers ad trailers. As i our aalogy, the dager i this layered approach is the cosiderable latecy added to message delivery. Clearly, oe way to reduce latecy is to reduce the umber of layers. But keep i mid that protocol families defie a stadard, but do ot force how the to implemet the stadard. Just as there are may ways to implemet a istructio set architecture, there are may ways to implemet a protocol family. essage Logical essage Actual Actual H T H T H T Logical H T H T H T Actual Actual T T H H T T H H T T H H T T H H T T H H T T Actual FIGURE 8.26 A geeric protocol stack with two layers. Note that commuicatio is peer-to-peer, with headers ad trailers for the peer added at each sedig layer ad removed by each receivig layer. Each layer offers services to the oe above to shield it from uecessary details.

53 614 Chapter 8 Itercoectio Networks ad Clusters V L Type Legth Idetifier Fragmet I header Time rotocol Header checksum Source estiatio Source estiatio Sequece o. (legth) iggyback ackowledgmet TC header L Flags Widow Checksum Urget poiter I data TC data (0 65,516 bytes) TC data 32 bits FIGURE 8.27 The headers for I ad TC. This drawig is 32 bits wide. The stadard headers for both are 20 bytes, but both allow the headers to optioally legthe for rarely trasmitted iformatio. Both headers have a legth of header field (L) to accommodate the optioal fields, as well as source ad destiatio fields. The legth field of the whole datagram is i a separate legth field i I, while TC combies the legth of the datagram with the sequece umber of the datagram by givig the sequece umber i bytes. TC uses the checksum field to be sure that the datagram is ot corrupted, ad the sequece umber field to be sure the datagrams are assembled ito the proper order whe they arrive. I provides checksum error detectio oly for the header, sice TC has protected the rest of the packet. Oe optimizatio is that TC ca sed a sequece of datagrams before waitig for permissio to sed more. The umber of datagrams that ca be set without waitig for approval is called the widow, ad the widow field tells how may bytes may be set beyod the byte beig ackowledged by this datagram. TC will adjust the size of the widow depedig o the success of the I layer i sedig datagrams; the more reliable ad faster it is, the larger TC makes the widow. Sice the widow slides forward as the data arrives ad is ackowledged, this techique is called a slidig widow protocol. The piggyback ackowledgmet field of TC is aother optimizatio. Sice some applicatios sed data back ad forth over the same coectio, it seems wasteful to sed a datagram cotaiig oly a ackowledgmet. This piggyback field allows a datagram carryig data to also carry the ackowledgmet for a previous trasmissio, piggybackig o top of a data trasmissio. The urget poiter field of TC gives the address withi the datagram of a importat byte, such as a break character. This poiter allows the applicatio software to skip over data so that the user does t have to wait for all prior data to be processed before seeig a character

54 8.9 Crosscuttig Issues for Itercoectio Networks 615 that tells the software to stop. The idetifier field ad fragmet field of I allow itermediary machies to break the origial datagram ito may smaller datagrams. A uique idetifier is associated with the origial datagram ad placed i every fragmet, with the fragmet field sayig which piece is which. The time-to-live field allows a datagram to be killed off after goig through a maximum umber of itermediate switches o matter where it is i the etwork. Kowig the maximum umber of hops that it will take for a datagram to arrive if it ever arrives simplifies the protocol software. The protocol field idetifies which possible upper layer protocol set the I datagram; i our case, it is TC. The V (for versio) ad type fields allow differet versios of the I protocol software for the etwork. Explicit versio umberig is icluded so that software ca be upgraded gracefully machie by machie, without shuttig dow the etire etwork Our protocol stack example is TC/I. Let s assume that the bottom protocol layer is Etheret. The ext level up is the Iteret rotocol or I layer; the official term for a I packet is datagram. The I layer routes the datagram to the destiatio machie, which may ivolve may itermediate machies or switches. I makes a best effort to deliver the packets, but does ot guaratee delivery, cotet, or order of datagrams. The TC layer above I makes the guaratee of reliable, i-order delivery ad prevets corruptio of datagrams. Followig the example i Figure 8.26, assume a applicatio program wats to sed a message to a machie via a Etheret. It starts with TC. The largest umber of bytes that ca be set at oce is 64 KB. Sice the data may be much larger tha 64 KB, TC must divide it ito smaller segmets ad reassemble them i proper order upo arrival. TC adds a 20-byte header (Figure 8.27) to every datagram, ad passes them dow to I. The I layer above the physical layer adds a 20-byte header, also show i Figure The data set dow from the I level to the Etheret is set i packets with the format show i Figure 8.20 o page 605. Note that the TC packet appears iside the data portio of the I datagram, just as Figure 8.26 suggests. 8.9 Crosscuttig Issues for Itercoectio Networks This sectio describes four topics discussed i other chapters that are fudametal to itercoectios. esity-optimized rocessors versus SEC-optimized rocessors Give that people all over the world are accessig WWW sites, it does t really matter where your servers are located. Hece, may servers are kept at collocatio sites, which charge by etwork badwidth reserved ad used, ad by space occupied ad power cosumed. esktop microprocessors i the past have bee desiged to be as fast as possible at whatever heat could be dissipated, with little regard to the size of the package ad surroudig chips. Oe microprocessor i 2001 burs 135 watts! Floor space efficiecy was also largely igored. As a result of these priorities, power is a major cost for collocatio sites, ad desity of processors is limited by the power cosumed ad dissipated.

55 616 Chapter 8 Itercoectio Networks ad Clusters With portable computers makig differet demads o power cosumptio ad coolig for processors ad disks, the opportuity exists for usig this techology to create cosiderably deser computatio. I such a case performace per watt or performace per cubic foot could replace performace per microprocessor as the importat figure of merit. The key is that may applicatios already work with large clusters (see sectio 8.10), so its possible that replacig 64 power hugry processors with, say, 256 efficiet processors could be cheaper to ru yet be software compatible. Smart Switches vs. Smart Iterface Cards Figure 8.28 shows a trade-off is where itelligece is located i the etwork. Geerally the questio is whether to have smarter etwork iterfaces or smarter switches. akig oe side smarter geerally makes the other side easier ad less expesive. By havig a iexpesive iterface it was possible for Etheret to become stadard as part of most desktop ad server computers. Lower cost switches were made available for people with small cofiguratios, ot eedig sophisticated routig tables ad spaig tree protocols of larger Etheret switches. Ifiibad is tryig a hybrid approach by offerig lower cost iterface cards for less demadig devices, such as disks, i the hopes that it will be icluded with some I/O devices. As Iifibad is plaed as the successor to CI bus, computers may come with a Host Chael Adapter built i. rotectio ad User Access to the Network A challege is to esure safe commuicatio across a etwork without ivokig the operatig system i the commo case. The Cray Research T3 supercomputer offers a iterestig case study. It supports a global address space, so loads ad stores ca access memory across the etwork. rotectio is esured because each access is checked by the TLB. To support trasfer of larger objects, a block trasfer egie (BLT) was added to the hardware. rotectio of access requires ivokig the operatig system before usig the BLT, to check the rage of accesses to be sure there will be o protectio violatios. Figure 8.29 compares the badwidth delivered as the size of the object varies for reads ad writes. For very large reads, 512 KB, the BLT does achieve the highest performace: 140 Bytes/sec. But simple loads get higher performace for 8 KB or less. For the write case, both achieve a peak of 90 Bytes/sec, presumably because of the limitatios of the memory bus. But for writes, BLT ca oly match the performace of simple stores for trasfers of 2 B; aythig smaller ad it s faster to sed stores. Clearly, a BLT that avoided ivokig the operatig system i the commo case would be more useful.

56 8.9 Crosscuttig Issues for Itercoectio Networks 617 Switch Small Scale Etheret Switch yriet Ifiibad Large Scale Etheret Switch Iterface Card Etheret Ifiibad Target Chael Adapter yriet Ifiibad Host Chael Adapter ore itellgice FIGURE 8.28 Itelligece i a etwork: Switch vs. Iterface card. Note that Etheret switches comes i two styles, depedig o the size of the etwork, ad that Ifiibad etwork iterfaces come i two styles, depedig o whether they are attached to a computer or to a storage device. yriet is a proprietary System Area Network Efficiet Iterface to emory Hierarchy versus Itercoectio Network Traditioal evaluatios of processor performace, such as SECit ad SECfp, ecourage itegratio of the memory hierarchy with the processor, as the efficiecy of the memory hierarchy traslates directly ito processor performace. Hece, microprocessors have first-level caches o chips alog with buffers for writes, ad usually have secod-level caches o-chip or immediately ext to the chip. Bechmarks such as SECit ad SECfp do ot reward good iterfaces to itercoectio etworks, ad hece may machies make the access time to the etwork delayed by the full memory hierarchy. Writes must lumber their way through full write buffers, ad reads must go through the cycles of first- ad secod-level cache misses before reachig the itercoectio. This hierarchy results i ewer systems havig higher latecies to itercoectios tha older machies. Let s compare three machies from the past. A 40-Hz SARCstatio-2, a 50-Hz SARCstatio-20 without a exteral cache, ad a 50-Hz SARCstatio-20 with a exteral cache. Accordig to SECit95, this list is i order of i-

57 618 Chapter 8 Itercoectio Networks ad Clusters BLT read CU write Badwidth (B/sec) BLT write CU read Trasfer size (bytes) FIGURE 8.29 Badwidth versus trasfer size for simple memory access istructios versus a block trasfer device o the Cray Research T3. (Arpaci et al. [1995].) creasig performace. The time to access the I/O bus (S-bus), however, icreases i this sequece: 200 s, 500 s, ad 1000 s. The SARCstatio-2 is fastest because it has a sigle bus for memory ad I/O, ad there is oly oe level to the cache. The SARCstatio-20 memory access must first go over the memory bus (-bus) ad the to the I/O bus, addig 300 s. achies with a secod-level cache pay a extra pealty of 500 s before accessig the I/O bus. O the other had, recet computers have dramatically improved memory badwidth, which is helpful to etwork badwidth. Compute-Optimized rocessors versus Receiver Overhead The overhead to receive a message likely ivolves a iterrupt, which bears the cost of flushig ad the restartig the processor pipelie. As metioed earlier, to read the etwork status ad to receive the data from the etwork iterface likely operates at cache miss speeds. As microprocessors become more superscalar ad go to faster clock rates, the umber of missed istructio issue opportuities per message receptio is likely to rise quickly over time.

58 8.10 Clusters Clusters...do-it-yourself Beowulf clusters built from commodity hardware ad software...has mobilized a commuity aroud a stadard architecture ad tools. Beowulf s ecoomics ad sociology are poised to kill off the other... architectural lies ad will likely affect traditioal supercomputer ceters as well. Gordo Bell ad Jim Gray [2001] Istead of relyig o custom machies ad custom etworks to build massively parallel machies, the itroductio of switches as part of LAN techology meat that high etwork badwidth ad scalig was available from off-the-shelf compoets. Whe combied with usig desktop computers ad disks as the computig ad storage devices, a much less expesive computig ifrastructure could be created that could tackle very large problems. Ad by their compoet ature, clusters are much easier to scale ad more easily isolate failures. There are may maiframe applicatios such as databases, file servers, Web servers, simulatios, ad multiprogrammig/batch processig ameable to ruig o more loosely coupled machies tha the cache-coheret NUA machies of Chapter 6. These applicatios ofte eed to be highly available, requirig some form of fault tolerace ad repairability. Such applicatios plus the similarity of the multiprocessor odes to desktop computers ad the emergece of high-badwidth, switch-based local area etworks lead to clusters of off-the-shelf, whole computers for large-scale processig. erformace Challeges of Clusters Oe drawback is that clusters are usually coected usig the I/O bus of the computer, whereas multiprocessors are usually coected o the memory bus of the computer. The memory bus has higher badwidth ad much lower latecy, allowig multiprocessors to drive the etwork lik at higher speed ad to have fewer coflicts with I/O traffic o I/O-itesive applicatios. This coectio poit also meas that clusters geerally use software-based commuicatio while multiprocessors use hardware for commuicatio. However, it makes coectios ostadard ad hece more expesive. A secod weakess is the divisio of memory: a cluster of N machies has N idepedet memories ad N copies of the operatig system, but a shared address multiprocessor allows a sigle program to use almost all the memory i the computer. Thus, a sequetial program i a cluster has 1/Nth the memory available compared to a sequetial program i a shared memory multiprocessor. Iterest-

59 620 Chapter 8 Itercoectio Networks ad Clusters igly, the drop i RA prices has made memory costs so low that this multiprocessor advatage is much less importat i 2001 tha it was i The primary issue i 2001 is whether the maximum memory per cluster ode is sufficiet for the applicatio. epedability ad Scalability Advatage of Clusters The weakess of separate memories for program size turs out to be a stregth i system availability ad expasibility. Sice a cluster cosists of idepedet computers are coected through a local area etwork, it is much easier to replace a machie without brigig dow the system i a cluster tha i a shared memory multiprocessor. Fudametally, the shared address meas that it is difficult to isolate a processor ad replace a processor without sigificat work by the operatig system ad hardware desiger. Sice the cluster software is a layer that rus o top of local operatig systems ruig o each computer, it is much easier to discoect ad replace a broke machie. Give that clusters are costructed from whole computers ad idepedet, scalable etworks, this isolatio also makes it easier to expad the system without brigig dow the applicatio that rus o top of the cluster. High availability ad rapid, icremetal extesibility make clusters attractive to service providers for the World Wide Web. ros ad Cos of Cost of Clusters Oe drawback of clusters has bee that the cost of owership. Admiisterig a cluster of N machies is close to the cost of admiisterig N idepedet machies, while the cost of admiisterig a shared address space multiprocessor with N processors is close to the cost of admiisterig a sigle, big machie. Aother differece betwee the two teds to be the price for equivalet computig power for large-scale machies. Sice large-scale multiprocessors have small volumes, the extra developmet costs of large machies must be amortized over few systems, resultig i higher cost to the customer. As we shall see, eve prices for compoets commo to small machies are icreased, possibly to recover developmet. I additio, the maufacturer learig curve (see 573 i the prior chapter) brigs dow the price of compoets used i the high volume C market. Sice the same switches sold i high volume for small systems ca be composed to costruct large etworks for large clusters, local area etwork switches have the same ecoomy-of-scale advatages as small computers. Origially, the partitioig of memory ito separate modules i each ode was a sigificat disadvatage to clusters, as divisio meas memory is used less efficietly tha o a shared address computer. The icredible drop i price of memory has mitigated this weakess, dramatically chaged the trade-offs i favor of clusters.

60 8.10 Clusters 621 Shootig for the Best of Both Worlds As is ofte the case with two competig solutios, each side tries to borrow ideas from the other to become more attractive. O oe side of the battle, to combat the high-availability weakess of multiprocessors, hardware desigers ad operatig system developers are tryig to offer the ability to ru multiple operatig systems o portios of the full machie. The goal is that a ode ca fail or be upgraded without brigig dow the whole machie. For example, the Su Fire 6800 server has these features (see sectio 5.15). O the other side of the battle, sice both system admiistratio ad memory size limits are approximately liear i the umber of idepedet machies, some are reducig the cluster problems by costructig clusters from small-scale shared memory multiprocessors. A more radical approach is to keep storage outside of the cluster, possibly over a SAN, so that all computers iside ca be treated as cloes of oe aother. As the odes may cost o the order of a few thousad dollars, it ca be cheaper to simply discard a flaky ode tha sped the labor costs to try hard to repair it. The tasks of the failed ode are the haded off to aother cloe. Clusters are also beefitig from faster SANs ad from etwork iterface cards that offer loweroverhead commuicatio. opularity of Clusters Low cost, scalig ad fault isolatio proved a perfect match to the compaies providig services over the Iteret sice the mid 1990s. Iteret applicatios such as search egies ad servers are ameable to more loosely coupled computers, as the parallelism cosists of millios of idepedet tasks. Hece, compaies like Amazo, AOL, Google, Hotmail, Iktomi, WebTV, ad Yahoo rely o clusters of Cs or workstatios to provide services used by millios of people every day. We delve ito Google i sectio Clusters are growig i popularity i the scietific computig market as well. Figure 8.30 shows the mix of architecture styles betwee 1993 ad 2000 for the top 500 fastest scietific computers. Oe attractio is that idividual scietists ca afford to costruct clusters themselves, allowig them to dedicate their cluster to their problem. Shared supercomputers are placed o mothly allocatio of CU time, so its plausible for a scietist to get more work doe from a private cluster tha from a shared supercomputer. It is also relatively easy for the scietist to scale his computig over time as he gets more moey for computig. Clusters are also growig i popularity i the database commuity. Figure 8.31 plots the cost-performace ad the cost-performace per processor of the differet architecture styles ruig the TC-C bechmark. Note i the top graph

61 622 Chapter 8 Itercoectio Networks ad Clusters Sigle Istructio ultiple ata (SI) Cluster (Network of Workstatios) Cluster (Network of Ss) asively arallel rocessors (s) Ju- 93 ec- 93 Ju- ec- Ju Uiprocessors ec- 95 Ju- 96 ec- 96 Ju- 97 ec- 97 FIGURE 8.30 lot of Top 500 supercomputer sites betwee 1993 ad Note that clusters of various kids grew from 2% to almost 30% i the last three years, while uiprocessors ad Ss have almost disappeared. I fact, most of the s i the list look are similar to clusters. I 2001, the top 500 collectively has a performace of about 100 Teraflops [Bell 2001]. erformace is measured as speed of ruig Lipack, which solves a dese system of liear equatios. This list at is updated twice a year. Ju- 98 ec- 98 Ju- 99 ec- 99 Ju- 00 Shared emory ultiprocessos (Ss) that ot oly are clusters fastest, they achieve good cost performace. For example, five Ss with just 6 to 8 processors have worse cost-performace tha the 280- processor cluster! Oly small Ss with two to four processors have much better cost performace tha clusters. This combiatio of high performace ad cost-effectiveess is rare. Figure 8.32 shows similar results for TC-H. The bottom half of Figure 8.31 shows the scalability of clusters for TC-C. They scale by about a factor of eight i price or processors while maitaiig respectable cost performace. Now that we have covered the pros ad cos of clusters ad showed their successes i several fields, the ext step is to desig some clusters.

62 8.10 Clusters Thousads of Trasactios per iute Cluster NUA S $0 $5 $10 $15 $20 Total System Cost ($illios) Trasactios per iute per $ Cluster NUA S Number of rocessors FIGURE 8.31 erformace, Cost, ad Cost-erformace per rocessor for TC-C. Not oly do clusters have the highest tpmc ratig, they have better cost performace ($/tpmc) for ay S with a total cost over $1. The bottom graph shows that clusters get high performace by scalig. They ca sustai 40 to 50 trasactios per miute per $1000 of cost from 32 to 280 processors. Figure 8.40 o page 636 describes the leftmost cluster, ad Figure 8.41 o page 637 shows the cost model of TC-C i more detail. These plots are for all computers that have ru versio 5 of the TC-C bechmark as of August 2001.

63 624 Chapter 8 Itercoectio Networks ad Clusters Queries per Hour (QphH) GB S.F. 300 GB S.F GB S.F GB S.F. Cluster NUA S 0 $0 $5,000 $10,000 $15,000 $20,000 rice (thousads) FIGURE 8.32 erformace vs. Cost for TC-H i August Clusters are used for the largest computers, NUA the smaller computers, ad S the smallest. I violatio of TC-H rules, this figure plots results for differet TC-H scale factors (SF): 100 GB, 300 GB, 1000 GB, ad 3000 GB. The ovals separate them esigig a Cluster To take the discussio of clusters from the abstract to the cocrete, this sectio goes through four examples of cluster desig. Like sectio 7.11 i the prior chapter, the examples evolve i realism. The examples of the last chapter which examied performace ad availability apply to clusters as well. Istead, we show cost trade-offs, a topic rarely foud i computer architecture. I each case we are a desigig a system with about 32 processors, 32 GB of RA, ad 32 or 64 disks. Figure 8.33 lists the compoets we use to costruct the cluster, icludig their prices. Before startig the examples, Figure 8.33 cofirms some of the philosophical poits of the prior sectio. Note that differece i cost ad speed processor is i the smaller systems versus the larger multiprocessor. I additio, the price per RA I goes up with the size of the computers. Regardig the processors, the server chip icludes a much larger L2 cache, icreasig from 0.25 B to 1 B. ue to its much larger die size, the price of 1- B-cache chip is more tha double the 0.25-B-cache. The purpose of the larger L2 cache is to reduce memory badwidth to allow eight processors to share a memory system. Not oly are these large caches chips much more expesive, its

64 8.11 esigig a Cluster 625 IB model ame xseries 300 xseries 330 xseries 370 aximum umber processors per box etium III rocessor Clock Rate (Hz) L2 Cache (KB) ,024 rice of base computer with 1 rocessor $1,759 $1,939 $14,614 rice per extra rocessor.a. $799 $1,779 rice per 256 B SRA I $159 $269 $369 rice per 512 B SRA I $549 $749 $1,069 rice per 1024 B SRA I.a. $1,689 $2,369 IB 36.4 GB 10K R Ultra160 SCSI $579 $639 $639 IB 73.4 GB 10K R Ultra160 SCSI.a. $1,299 $1,299 CI slots: 32bit,33 Hz / 64bit,33 Hz / 64bit,66 Hz 1 / 0 / 0 0 / 2 / 0 0 / 8 / 4 Rack space (VE Rack Uits) ower Supply 200 W 200 W 3 x 750 W Emulex clan-1000 Host Adapter (1 Gbit) $795 $795 $795 Emulex clan port switch $6,280 $6,280 $6,280 Emulex clan5000 Rack space (R.U.) Emulex clan port switch $15,995 $15,995 $15,995 Emulex clan5300 Rack space (R.U.) Emulex clan meter cable $135 $135 $135 Extra CI Ultra160 SCSI Adapter $299 $299 $299 EX300 Storage Eclosure (up to 14 disks) $3,179 $3,179 $3,179 EX300 Rack space (VE Rack Uits) Ultra2 SCSI 4-meter cable $105 $105 $105 Stadard 19-i Rack (44 VE Rack Uits) $1795 $1795 $1795 FIGURE 8.33 rices of optios for three rack-mouted servers from IB ad 1-Gbit Etheret switches from Emulex i August Note the higher price for processors ad RA Is with larger computers. The base price of these computers icludes 256 B of RA (512 B for 8-way server), two slots for disks, a UltraSCSI 160 adapter, two 100 bit Etherets, a C-RO drive, a floppy drive, six to eight fas, ad SVGA graphics. The power supply for the Emulex switches is 200 watts ad is 500 watts for the EX300. the xseries 370 you must add a accelerator costig $1249 to go over 4 CUs.

65 626 Chapter 8 Itercoectio Networks ad Clusters has also bee hard for Itel to achieve the similar clock rates to the small-cache chips: 700 Hz vs Hz i August The higher price of the RA is harder too explai based o cost. For example, all iclude ECC. The uiprocessor uses 133 Hz SRA ad the 2-way ad 8-way both use registered I modules (RI) SRA. There might a slightly higher cost for the buffered RA betwee the uiprocessor ad 2- way boxes, but it is hard to explai icreasig price 1.5 times for the 8-way S vs. the 2-way S. I fact, the 8-way SRA operates at just 100 Hz. resumably, customers willig to pay a premium for processors for a 8-way S are also willig to pay more for memory. Reasos for higher price matters little to the desiger of a cluster. The task is to miimize cost for a give performace target. To motivate this sectio, here is a overview of the four examples: 1. Cost of Cluster Hardware Alteratives with Local isk: The first example compares the cost of buildig from a uiprocessor, a 2-way S, ad a 8- way S. I this example, the disks are directly attached to the computers i the cluster. 2. Cost of Cluster Hardware Alteratives with isks over SAN: The secod example moves the disk storage behid a RAI cotroller o a Storage Area Network. 3. Cost of Cluster Optios that is more realistic: The third example icludes the cost of software, the cost of space, some maiteace costs, ad operator costs. 4. Cost ad erformace of a Cluster for Trasactio rocessig: This fial example describes a similar cluster tailored by IB to ru the TC-C bechmark. (It is oe of the cluster results i Figure 8.31.)This example has more memory ad may more disks to achieve a high TC-C result, ad at the time of this writig, it the 13th fastest TC-C system. I fact, the machie with the fastest TC-C is just a replicated versio of this cluster with a bigger LAN switch. This sectio highlights the differeces betwee this database-orieted cluster ad the prior examples. First Example: Cost of Cluster Hardware Alteratives with Local isk This first example looks oly at hardware cost of the three alteratives usig the IB pricig iformatio. We ll look at the cost of software ad space later. EXALE Usig the iformatio i Figure 8.33, compare the cost of the hardware for three clusters built from the three optios i the figure. I additio, calculate the rack space. The goal for this example is to costruct a cluster with 32 processors, 32 GB of memory protected by ECC, ad more tha 2 TB of disk. Coect the clusters with 1 gigabit, switched Etheret.

66 8.11 esigig a Cluster 627 ANSWER Figure 8.34 shows the logical orgaizatio of the three clusters. Let s start with the 1-processor optio (IB xseries model 300). First, we eed 32 processors ad thus 32 computers. The maximum memory for this computer is 1.5 GB, allowig 1GB x 32 = 32 GB. Each computer ca hold two disks ad the largest disk available i the model 300 is 36.4 GB, yieldig 32 x 2 x 36.4 GB or 2330 GB. Usig the built-i slots for storage is the least expesive solutio, so we ll take this optio. Each computer eeds its ow Gbit Host Adapter, but 32 cables are more tha a 30- port switch ca hadle. Thus, we use two Emulex clan5300 switches. We coect the two switches together with four cables, leavig plety for of ports for the 32 computers. A stadard VE rack is 19 iches wide ad about 6 feet tall, with a typical depth of 30 iches. This size is so popular that it has its ow uits: 1 VE rack uit (RU) is about 1.75 iches high, so a rack ca hold objects up to 44 RU. The 32 uiprocessor computers each use 1 rack uit of space, plus 2 rack uits for each switch, for a total of 36 rack uits. This fits sugly i oe stadard rack. For the 2-processor case (model 330), everythig is halved. The 32 processors eed oly 16 computers. The maximum memory is 4 GB, but we eed just 2 GB per computer to hit our target of 32 GB total. This model allows 73.4 GB disks, so we eed oly 16 x 2 x 73.4 GB to reach 2.3 TB, ad these disks fit i the slots i the computers. A sigle 30-port switch has more ports tha we eed. The total space demad is 18 rack uits (16 x x 2), or less tha half a stadard rack. The 8-processor case (model 370) eeds oly 4 computers to hold 32 processors. The maximum memory is 32 GB, but we eed just 8 GB per computer to reach our target. Sice there are oly 4 computers, the 8-port switch is fie. The shortigcomig is i disks. At 2 disks per computer, these 4 computers ca hold at most 8 disks, ad the maximum capacity per disk is still 73.4 GB. The solutio is to add a storage expasio box (EX300) to each computer, which ca hold up to 14 disks. This solutio requires addig a exteral UltraSCSI cotroller to each computer as well. The rack space is 8 RU for the computer, 3 RU for the disk eclosure, ad 1 RU for the switch. Alas, the total is 4 x (8 + 3) + 1 or 45 rack uits, which just misses the maximum of a stadard rack. Hece, this optio occupies two racks. Figure 8.35 shows the total cost of each optio. This example shows some issues for clusters: * Expasibility icurs high prices. For example, just 4 of the base 8- way Ss--each with just oe processor ad 0.5 GB of RA-- costs more tha 32 uiprocessor computers, each with 1 processor ad 0.25 GB of RA. The oly hope of cost competitiveess is to occupy all the optios of a large S.

67 628 Chapter 8 Itercoectio Networks ad Clusters FIGURE 8.34 Three cluster orgaizatios based o uiprocessors (top), 2-way Ss (middle), ad 8-way Ss (bottom). stads for processor, for memory (1, 2, ad 8 GB), ad for disk (36.4, 73.4, 73.4 GB). 1 Gigabit Etheret Switch 1 Gigabit Etheret Switch Iteret Iteret Iteret 1 Gigabit Etheret Switch 1 Gigabit Etheret Switch Cluster 2-way S Cluster 8-way S Cluster

68 8.11 esigig a Cluster 629 * Network vs. local bus trade-off. Figure 8.35 shows how the larger Ss eed less to sped less o etworkig, as the memory buses carry more of the commuicatio workload. The uiprocessor cluster costs 1.1 times the 2-way S optio, ad the 8-way S cluster cost 1.6 times the 2-way S. The 2-way S wis the cost competitio because the compoets are relatively cost-effective ad it eeds fewer systems ad etwork compoets. rice (thousads) $320 $280 $240 $200 $160 $120 $80 etwork disk extra memory extra processors system $180 $161 $62 $31 $42 $37 $23 $43 $253 $10 $56 $71 $55 $40 $0 $56 $13 $58 $31 1-way 2-way 8-way FIGURE 8.35 rice of three clusters with a total of 32 processors, 32 GB memory, ad 2.3 TB disk. Note the reductio i etwork costs as the size of the S icreases, sice the memory buses supply more of the iterprocessor commuicatio. Rack prices are icluded i the total price, but are too small to show i the bars. They accout for $1725 i the first two cases ad $3450 i the third case. Secod Example: Usig a SAN for disks. The previous example uses disks local to the computer. Although this ca reduce costs ad space, the problem for the operator is that 1) there is o protectio

69 630 Chapter 8 Itercoectio Networks ad Clusters agaist a sigle disk failure, ad 2) there is state i each computer that must be maaged separately. Hece, the system is dow o a disk failures util the operator arrives, ad there is o separate visibility or access to storage. This secod example cetralizes the disks behid a RAI cotroller i each case usig FC-AL as the Storage Area Network. To keep comparisos fair, we cotiue use of IB compoets. Figure 8.36 lists the costs of the compoets i this optio. Note that this IB RAI cotroller requires FC-AL disks. IB FC-AL High Availability RAI storage server $15,999 IB 73.4 GB 10K R FC-AL disk $1,699 IB EX500 FC-AL storage eclosure (up to 10 disks) $3,815 FC-AL 10-meter cables $100 IB CI FC-AL Host Bus adaptor $1,485 IB FC-AL RAI server rack space (VE rack uits) 3 IB EX500 FC-AL rack space (VE rack uits) 3 FIGURE 8.36 Compoets for Storage Area Network cluster. EXALE Usig the iformatio i Figure 8.36, calculate the cost of the hardware for three clusters above but ow use the SAN ad RAI cotroller. ANSWER The chage from the clusters i the first example is that we remove all iteral SCSI disks ad replace them with FC-AL disks behid the RAI storage server. To coect to the RAI box, we add a FC-AL host bus adapter per computer to the uiprocessor ad 2-way S clusters ad replace the SCSI host bus adapter i the 8-way S cluster. FC-AL ca be coected i a loop with up to 127 devices, so there is o problem i coectig the computers to the RAI box. The RAI box has a separate FC-AL loop for the disks. It has room for 10 FC-AL disks, so we eed three EX500 eclosures for the remaiig 22 FC-AL disks. (The FC-AL disks are half-height, which are taller tha the low profile SCSI disks, so we ca fit oly 10 FC-AL disks per eclosure.) We just eed to add cables for each segmet of the loop. Sice the RAI box eeds 3 rack uits as do each of the 3 eclosures, we eed 12 additioal rack uits of space. This adds a secod rack to the uiprocessor cluster, but there is sufficiet space i the racks of the other clusters. If we use RAI-5 ad have a parity group size of 8 disks, we still have 28 disks of data or 28 x 73.4 or 2.05 TB of user data, which is sufficiet for our goals. Figure 8.37 shows the hardware costs of this solutio. Sice there

70 8.11 esigig a Cluster 631 $320 $280 $240 $200 rice (Thousads) $160 $281 $54 $27 $62 SAN etwork RAI+eclosure LAN etwork disk memory extra processors system $230 $29 $27 $31 $289 $10 $27 $10 $54 $71 $120 $80 $54 $54 $23 $43 $55 $40 $0 $56 $13 $31 $58 1-way 2-way 8-way FIGURE 8.37 rices for hardware for three clusters usig SAN for storage. As i Figure 8.35, the cost of the SAN etwork also shriks as the servers icrease i umber of processors per computer. They share the FC-AL host bus adapters ad also have fewer cables. Rack prices are too small to see i the colums, but they accout for $3450, $1725, ad $3450, respectively. must be oe FC-AL host bus adapter per computer, they cost eough to brig the prices of the uiprocessor ad 8-way S clusters to parity. The 2-way S is still substatially cheaper. Notice that agai the cost of both the LAN etwork ad the SAN etwork decrease as the umber of computers i the cluster decrease. The SAN adds about $40,000 to $100,000 to the price of the hardware for the clusters. We ll see i the ext example whether we justify such costs.

71 632 Chapter 8 Itercoectio Networks ad Clusters Third Example: Accoutig for Other Costs The first ad secod examples oly calculated the cost of the hardware (which is what you might expect from book o computer architecture). There are two other obvious costs ot icluded: software ad the cost of a maiteace agreemet for the hardware. Figure 8.38 lists the costs covered i this example. Software: Widows CUs + IB irector $799 Software: Widows CUs + IB irector $3,295 Software: SQL Server atabase (per processor!) $16,541 3-year HW maiteace: LAN switches + HBA $45,000 3-year HW maiteace: IB xseries computers 7.5% Rack space retal (mothly per rack) $800 to $1200 Extra 20 amp circuit per rack (mothly) $200 to $400 Badwidth charges per megabit (mothly) $500 to $2000 Operator costs (yearly) $100,000 LT tapes (40 GB raw, 80 GB compressed) $70 FIGURE 8.38 Compoets for Storage Area Network cluster i Notice the higher cost of the operatig system i the larger server. (Redhat Liix 7.1, however, is $49 for all three.) Notice that icrosoft quadruples the price whe the operatig system rus o a computer with 5 to 8 processors versus a computer with 1 to 4 processors. oreover, the database cost is primarily a liear fuctio of the umber of processors. Oce agai, software pricig appears to be based o value to the customer versus cost of developmet. Aother sigificat cost is the cost of the operators to keep the machie ruig, upgrade software, perform backup ad restore, ad so o. I 2001, the cost (icludig overhead) is about $100,000 per year for a operator. I additio to labor costs, backup uses up tapes to act as the log-term storage for system. A typical backup policy is daily icremetal dumps ad weekly full dumps. A commo practice is to save four weekly tapes ad the oe full dump per moth for the last six moths. The total is 10 full dumps, plus a week of icremetal dumps. There are other costs, however. Oe is the cost of the space to house the server. Thus, collocatio sites have bee created to provide virtual machie rooms for compaies. They provide scalable space, power, coolig, ad etwork badwidth plus provide physical security. They make moey by chargig ret for space, for etwork badwidth, ad for optioal services from o-site admiistrators.

72 8.11 esigig a Cluster 633 Collocatio rates are egotiated ad much cheaper per uit as space requiremets icrease. A rough guidelie i 2001 is that rack space, which icludes oe 20-amp circuit, costs $800 to $1200 per moth. It drops by 20% if you use more tha 75 to 100 racks. Each additioal 20 amp circuit per rack costs aother $200 to $400 per moth. Although we are ot calculatig these costs i this case, they also charge for etwork badwidth: $1500 to $2000 per bits/sec per moth, if your cotiuous use is just 1-10 bits/secod, drops to $500 to $750 per bits/ sec per moth, if your cotiuous use measures 1-2 Gbits/secod. acific Gas ad Electric i Silico Valley limits a sigle buildig to have o more tha 12 megawatts of power ad the typical size of a buildig is o more tha 100,000 square feet. Thus, a guidelie is that collocatio sites are desiged assumig o more tha 100 watts per square foot. If you iclude the space for people to get access to a rack to repair ad replace compoets, a rack eeds about 10 square feet. Thus, collocatio sites expect at most 1000 watts per rack. EXALE Usig the iformatio i Figure 8.38, calculate the total cost of owership for three years: purchase prices, operator costs, ad maiteace costs. ANSWER Figure 8.39 shows the total cost of owership for the six clusters. To keep thigs simple, we assume each system with local disks eeds a full -time operator, but the clusters that access their disks over a SAN with RAI eed oly a half-time operator. Thus, operator cost is 3 x $100,000 = $300,000 or 3 x $50,000 = $150,000. For backup, let s assume we eed eough tapes to store 2 TB for a full dump. We eed four sets for the weekly dumps plus six more sets so that we ca have a six-moth archive. Tape uits ormally compress their data to get a factor of two i desity, so we ll assume compressio successfully turs 40 GB drives ito 80 GB drives. The cost of these tapes is: 2000GB 10 80GB/tape $ 70 = $ 70 = $17, 500 The daily backups deped o the amout of data chaged. If 2 tapes per day are sufficiet (up to 8% chages per day), we eed to sped aother 7 2 $ 70 = 14 $ 70 = $ 980 The figure lists maiteace costs for the computers ad the LAN. The disks come with a 3-year warraty, so there is o extra maiteace cost for them. The cost per rack of retal space for three years is 3 x 12 x $1000 or $36,000. Figure 8.39 shows the 2-way S usig SAN is the wier. Note that hardware costs are oly a half to a third of the cost of owership. Over

73 634 Chapter 8 Itercoectio Networks ad Clusters $800 $700 $600 $500 $609 $18 $576 $18 $707 $18 $300 $598 $18 $150 Backup tapes Operator Space retal SW costs HW maiteace $594 HW purchase $18 $497 $18 $150 $400 rice (Thousads) $300 $200 $300 $36 $26 $49 $300 $36 $13 $47 $72 $13 $50 $72 $26 $51 $150 $36 $13 $49 $72 $13 $51 $100 $180 $161 $253 $281 $230 $289 $0 1-way 2-way 8-way 1-way SAN 2-way SAN 8-way SAN FIGURE 8.39 Total cost of owership for three years for clusters i Figures 8.35 ad Operator costs are as sigificat as purchase price, ad hece the assumptio that SAN halves operator costs is very sigificat. three years the operator costs ca be more tha the cost of purchase of the hardware, so reducig those costs sigificatly reduces total cost of owership. Our results deped o some critical assumptios, but surveys of the total cost of owership for items with storage go up to factors five to te over purchase price.

74 8.11 esigig a Cluster 635 Fourth Example: Cost ad erformace of a Cluster for Trasactio rocessig The August 2001 TC-C report icludes a cluster built from similar buildig blocks to the examples above. This cluster also has 32 processors, uses the same IB computers as buildig blocks, ad it uses the same switch to coect computers together. Figure 8.40 shows its orgaizatio. It achieves 121,319 queries for hour for $2.2. Here are the key differeces: isk size: sice TC-C cares more about I/Os per secod (IOS) tha disk capacity, this clusters uses may small fast disks. The use of small disks gives may more IOS for the same capacity. These disks also rotate at R vs R, deliverig more IOS per disk. The 9.1-GB disk costs $405 ad the 18.2-GB disk costs $549, or a icrease i dollars per GB of factor of 1.7 to 2.5. The totals are GB disks ad GB disks, yieldig a total capacity of 8 TB. (resumably the reaso for the mix of sizes is get sufficiet capacity ad IOS to ru the bechmark.) These 720 disks eed or 52 eclosures, which is 13 eclosures per computer. I cotrast, earlier 8-way clusters achieved 2 TB with 32 disks, as we cared more about cost per GB tha IOS. RAI: Sice the TC-C bechmark does ot factor i huma costs for ruig a system, there is little icetive to use a SAN. TC-C does require a RAI protectio of disks, however. IB used a RAI product that plugs ito a CI card ad provides four SCSI strigs. To get higher availability ad performace, each eclosure attaches to two SCSI buses. Thus, there are 52 x 2 or 104 SCSI cables attached to the 28 RAI cotrollers which support up to 28 x 4 or 106 strigs. emory: Covetioal wisdom for TC-C is to pack as much RA as possible ito the servers. Hece, each of the four 8-way Ss is stuffed with the maximum of 32 GB, yieldig a total of 128 GB. rocessor: This bechmark uses 900 Hz etium III with a 2B L2 cache. The price is $6599 as compared to prior 8-way clusters for $1799 for the 700 Hz etium III with a 1 B L2 cache. CI slots: This cluster uses 7 of the 12 available CI bus slots for the RAI cotrollers compared to 1 CI bus slot for a exteral SCSI or FC-AL cotroller i the prior 8-way clusters. This greater utilizatio follows the guidelie of tryig to use all resources of a large S. Tape Reader, oitor, Uiterruptable ower Supply: To make the system easier to come up ad to keep ruig for the bechmark, IB icludes oe LT tape reader, four moitors, ad four USs.

75 636 Chapter 8 Itercoectio Networks ad Clusters 1 Gigabit Etheret Switch TC-C Cliets R R R R R R R R R R R R R R R R R R R R R R R R R R R R (5 eclosures ot show) (5 eclosures ot show) (5 eclosures ot show) (5 eclosures ot show) FIGURE 8.40 IB Cluster for TC-C. This cluster has 32 etium III processors, each ruig at 900 Hz with a 2B L2 cache. The total of RA memory is 128 GB. Seve CI slots i each computer cotai RAI cotrollers (R for RAI), ad each has four Ultra160 SCSI strigs. These strigs coect to 13 storage eclosures per computer, givig 52 total. Each eclosure has 14 SCSI disks, either 9.1 GB or 18.2 GB. The total is GB disk ad GB disks. There are also two 9.1 GB disks iside each computer that are used for pagig ad rebootig.

76 8.11 esigig a Cluster 637 aiteace ad spares: TC-C allows use of spares to reduce maiteace costs, which is a miimum of two spares or 10% of the items. Hece, there are two spare Etheret switches, host adapters, ad cables for TC-C. Figure 8.41 compares the 8-way cluster from before to this TC-C cluster. Note that almost half of the cost is i software, istallatio, ad maiteace for the TC-C cluster. At the time of this writig, the computer with the fastest TC-C result basically scales this cluster from 4 to 35 xseries 370 servers ad uses bigger Etheret switches.. 8-way SAN Cluster TC-C Cluster 4 Systems (700 Hz/1B v. 900 Hz/2B) $58 17% $76 3% 28 Extra processors (700 Hz/1B v. 900 Hz/2B) $55 16% $190 8% Extra memory (8 GB v. 32 GB) $71 20% $306 14% isk drives (2TB/73.4GB v. 8TB/9.1,18.2 GB) $54 15% $316 14% isk eclosures (3 v. 52) $11 3% $165 7% RAI cotroller (1 v. 28) $16 4% $69 3% LAN etwork (1 switch/4 HBAs v. 3 switches/6 HBAs) $10 3% $24 1% SAN etwork (4 NICs, cables v. 0) $10 3%.a. 0% Software (Widows v. Widows + SQL server + istallatio) $13 4% $951 42% aiteace + hardware setup costs $51 14% $115 5% Racks, US, backup (2 racks vs. 7 racks + 4 US +1 tape uit) $3 1% $40 2% Total $ % $2, % FIGURE 8.41 Comparig 8-way SAN cluster ad TC-C cluster i price (i $1000) ad percetage. The higher cost of the system ad extra processors is due to usig the faster chips with the larger caches. emory costs are higher due to more total memory ad usig the more expesive 1 GB Is. The icreased disk costs ad disk eclosure costs are due to higher capacity ad usig smaller drives. Software costs icrease due to addig SQL server database plus IB charges for software istallatio of this cluster. Similarly, although hardware maiteace costs are close, IB charged to setup seve racks of hardware, whereas we assumed the customer assembled two racks of hardware for free. Fially, SAN costs are higher due to TC-C policy of buyig spares to lower maiteace costs. Summary of Examples With completio of the cluster tour, you ve see a variety of cluster desigs, icludig oe represetative of the state-of-the-art cost-performace cluster i Note that we cocetrated o cost i costructig these clusters, but oly book legth prevets us from evaluatig the performace ad availability bottleecks i these desigs. Give the similarity to performace aalysis of storage systems i the last chapter, we leave that to the reader i the exercises.

77 638 Chapter 8 Itercoectio Networks ad Clusters Havig completed the tour of cluster examples, a few thigs stadout. First, the cost of purchase is less tha half the cost of owership. Thus, ivetios that oly help with hardware costs ca solve oly a part of the problem. For example, despite the higher costs of SAN, they may lower cost of owership sufficietly to justify the ivestmet. Secod, the smaller computers are geerally cheaper ad faster for a give fuctio compared to the larger computers. I this case, for the larger cache required to allow several processors to share a bus meas a much larger die, which icreases cost ad limits clock rate. Third, space ad power matter for both eds of the computig spectrum: clusters at the high ed ad embedded computers at the low ed uttig It All Together: The Goggle Cluster of Cs Figure 8.42 shows the rapid growth of the World Wide Web ad the correspodig demad for searchig it. The umber of pages idexed grew by a factor of 1000 betwee 1994 ad 1997, but people were still oly iterested i the top 10 aswers, which was a problem for search egies. I 1997, oly oe quarter of the search egies would fid themselves i their top 10 queries. ate WWW pages idexed (illio) Queries per day (illio) Search Egie April World Wide Web Worm November Alta Vista ecember Google FIGURE 8.42 Growth i pages idexed ad search queries performed by several search egies. [Bri ad age, 1998] Searches have bee growig about 20% per moth at Google, or about 8.9 times per year. ost of the 1.3 billio pages are fully idexed ad cached at Google. Google also idexes pages based oly o the URLs i cached ad idexed pages, so about 40% of the 1.3 billio are just URLs without cached copies of the page at Google. Heessy atterso Google Search I m Feelig Lucky Searched the web for Heessy atterso. Results 1-10 of about 13,300. Search took 0.23 secods. Computer Architecture: A Quatitative Approach... o curretly predomiat ad emergig commercial systems, the Heessy ad atterso have prepared etirely ew chapters coverig additioal advaced topics: k - Cached - Similar pages FIGURE 8.43 First etry i result of a search for Heessy atterso. Note that the search took less tha 1/4 secod, ad that it icludes a capsule summary of the cotets from the WWW page at orga Kauffma, ad that it offers you to either follow the actual URL ( or just read the cached copy of the page (Cached) stored i the Google cluster.

78 8.12 uttig It All Together: The Goggle Cluster of Cs 639 Google was desiged first to be a search egie that could scale at that growth rate. I additio to keepig up with the demad, Google improved the relevace of the top queries produced so that user would likely get what the desired result. For example, Figure 8.43 shows the first Google result for the query Heessy atterso, which from your authors perspective is the right aswer. Techiques to improve search relevace iclude rakig pages by popularity, examiig the text at the achor sites of the URLs, ad proximity of keyword text withi a page. Search egies also have a major reliability requiremet, as people are usig it at all times of the day ad from all over the world. Google must essetially be cotiuously available. Sice a search egie is ormally iteractig with a perso, its latecy must ot exceed its users patiece. Google s goal is that o search takes more tha 0.5 secods, icludig etwork delays. As the figures above show, badwidth is also vital. I 2000, Google served a average of almost 1000 queries per secod as well as searched ad idexed more tha a billio pages. I additio, a search egie must crawl the WWW regularly to have up-todate iformatio to search. Google crawls the etire WWW ad updates its idex every 4 weeks, so that every WWW page is visited oce a moth. Google also keeps a local copy of the text of most pages so that it ca provide the sippet text as well as offer a cached copy of the page, as show i Figure escriptio of the Google Ifrastructure To keep up with such demad, i ecember 2000 Google uses more tha 6000 processors ad disks, givig Google a total of about oe petabyte of disk storage. At the time, the Google site was likely the sigle system with the largest storage capacity i the private sector. Rather tha achievig availability by usig RAI storage, Google relies o redudat sites each with thousads of disks ad processors: two sites are i Silico Valley ad oe i Virgiia. The search idex, which is a small umber of terabytes, plus the repository of cached pages, which is o the order of the same size, are replicated across the three sites. Thus, if a sigle site fails, there are still two more that ca retai the service. I additio, the idex ad repository are replicated withi a site to help share the workload as well as to cotiue to provide service withi a site eve if compoets fail. Each site is coected to the Iteret via OC48 (2488 bits/sec) liks of the collocatio site. To provide agaist failure of the collocatio lik, there is a separate OC12 lik coectig the two Silico Valley sites so that i a emergecy both sites ca use the Iteret lik at oe site. The exteral lik is ulikely to fail at both sites sice differet etwork providers supply the OC48 lies. (The Virgiia site ow has a sister site to provide so as to provide the same beefits.) Figure 8.44 shows the floor pla of a typical site. The OC48 lik coects to two Foudry BigIro 8000 switches via a large Cisco switch. Note that

79 640 Chapter 8 Itercoectio Networks ad Clusters rack OC12 Fd swtch OC48 rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack rack Fd swtch OC12 OC48 FIGURE 8.44 Floor pla of a Google cluster, from a God s eye view. There are 40 racks, each coected via 4 copper Gbit Etheret liks to 2 redudat Foudry 128 by 128 switches ( Fd swtch ). Figure 8.45 shows a rack cotais 80 Cs, so this facility has about 3200 Cs. (For clarity, the liks are oly show for the top ad bottom rack i each row.) These racks are o a raised floor so that the cables ca be hidde ad protected. Each Foudry switch i tur is coected to the collocatio site etwork via a OC48 (2.4 Gbit) to the Iteret. There are two Foudry switches so that the cluster is still coected eve if oe switch fails. There is also a separate OC12 (622 bit) lik to a separate earby collocatio site i case the OC48 etwork of oe collocatio site fails; it ca still serve traffic over the OC12 to the other sites etwork. Each Foudry switch ca hadle Gbit Etheret lies ad each rack has 2 1-Gbit Etheret lies per switch, so the maximum umber of racks for the site is 64. The two racks ear the Foudry switches cotai a few Cs to act as frot eds ad help with tasks such as html service, load balacig, moitorig, ad US to keep the switch ad frots up i case of a short power failure. It would seem that a facility that has redudat diesel egies to provide idepedet power for the whole site would make US redudat. A survey of data ceter users suggests power failures still happe yearly. this lik is also coected to the rest of the servers i the site. These two switches are redudat so that a switch failure does ot discoect the site. There is also a OC12 lik from the Foudry switches to the sister site for emergecies. Each switch ca coect to Gbit/sec Etheret lies. Racks of Cs, each with 4 1- Gbit/sec Etheret iterfaces, are coected to the 2 Foudry switches. Thus, a sigle site ca support or 64 racks of Cs. Figure 8.45 shows Google s rack of Cs. Google uses Cs that are oly 1 VE rack uit. To coect these Cs to the Foudry switches, it uses a H Etheret switch. It is 4 RU high, leavig room i the rack for 40 Cs. This switch has modular etwork iterfaces, which are orgaized as removable blades. Each blade ca cotai bits/s Etheret iterfaces or a sigle 1-Gbit Etheret iterface. Thus, 5 blades are used to coect 100 bits/s Cat5 cables to each of the 40 Cs i the rack, ad 2 blades are used to coect 1-Gbit/sec copper cables to the two Foudry switches. As Figure 8.45 shows, to pack eve more Cs i a rack Google uses the same cofiguratio i the frot ad back of the rack, yieldig 80 Cs ad 2 switches per rack. There is about a 3-ich gap i the middle betwee the colums of Cs for the hot air to exit, which is draw out of the chimey via exhaust fas at the top of the rack.

80 8.12 uttig It All Together: The Goggle Cluster of Cs 641 Close-up view of 1 RU Cs 19 iches Frot view (also Back view) Side view FIGURE 8.45 Frot view, side view, ad close-up of a rack of Cs used by Google. The photograph o the left shows the H rocurve 4000 Etheret switch i the middle, with 20 Cs above ad 20 Cs below. Each C coects via a Cat5 cable o the left side to the switch i the middle, ruig 100 bit Etheret. Each blade of the switch ca hold bit Etheret iterfaces or 1 1Gbit iterface. There are also two 1 Gbit Etheret liks leavig the switch o the right. Thus, each C has oly 2 cables: 1 Etheret ad 1 power cord. The far right of the photo shows a power strip, with each of the 40 Cs ad the switch coected to it. Each C is 1 VE rack uit (RU) high. The switch i the middle is 4 RU high. The photo o the middle is a close up of rack, showig cotets of a 1 RUC. This uit cotais 2 axtor iamodax 5400 R IE drives o the right of the box, 256 B of 100 Hz SRA, a C motherboard, a sigle power supply, ad a Itel microprocessor. Each C rus versios or Liix kerels o a slightly modified RedHat release. Betwee arch 2000 ad November 2000, over the period the Google site was populated, the microprocessor varied i performace from a 533 Hz Celero to a 800 Hz etium III. The goal was selectig good cost performace, which was ofte close to $200 per chip. isk capacity varied from 40 to 80 GB. You ca see the Etheret cables o the left, power cords o the right, ad table Etheret cables coected to the switch at the top of the figure. I ecember 2000 the uassembled parts costs are about $500 for the two drives, $200 for the microprocessor, $100 for the motherboard, ad $100 for the RA. Icludig the eclosure, power supply, fas, cablig ad so o, a assembled C might cost $1300 to $1700. The drawig o the right shows that Cs are kept i two colums, frot ad back, so that a sigle rack holds 80 Cs ad 2 switches. The typical power per C is about 55 watts ad about 70 watts per switch, so a rack uses about 4500 watts. Heat is exhausted ito a 3-ich vet betwee the two colums, ad the hot air is draw out the top usig fas. (The drawig shows uses 22 Cs per side each 2 RU high istead of the Google cofiguratio of 40 1 RU Cs plus a switch per side.) (hotos ad figure from Rackable Systems:

81 642 Chapter 8 Itercoectio Networks ad Clusters The C itself is a fairly stadard: 2 axtor ATA/IE drives, 256 B of SRA, a modest Itel microprocessor, a C motherboard, oe power supply ad a few fas. Each C rus the Liix operatig system. To get the best value per dollar, every 2-3 moths Google icreases the capacity of the drives or the speed of the processor. Thus, the 40 rack site show above was populated betwee arch ad November 2000 has microprocessors that are from a 533 Hz Celero to a 800 Hz etium III, disks that vary i capacity betwee 40 ad 80 GB ad i speed at 5400 to 7200 R, ad memory bus speed is either 100 or 133 Hz. erformace Each collocatio site coects to the Iteret via OC48 (2488 bits/sec) liks, which is shared by Google ad the other Iteret service providers. If a typical respose to a query is, say, 4000 bytes, the the average badwidth demad is 70, 000, 000 queries/day 4000 B/query 8 bits/b secods/day = 2, 240, 000 bits bit/s 86,400 secods which is just 1% of the lik speed of each site. Eve if we multiply by a factor of 4 to accout for peak versus average demad ad requests as well as resposes, Google eeds little of that badwidth. Crawlig the web ad updatig the sites eeds much more badwidth tha servig the queries. Let s estimate some parameters to put thigs ito perspective. Assume that it takes 7 days to crawl a billio pages: 1, 000, 000, 000 pages 4000 B/page 8 bits/b secods/day 7days = 32, 000, 000 bits bit/s 604, 800 secods This data is collected at a sigle site, but the fial multi-terabyte idex ad repository must the be replicated at the other two sites. If we assume we have 7 days to replicate the data ad that we are shippig, say, 5 terabytes from oe site to two sites, the the average badwidth demad is 5, 000, 000 B 8 bits/b 2 80,000,000 bits = secods/day 7days bit/s 604, 800 secods Hece, the machie to perso badwidth is relatively trivial, with the real badwidth demad beig machie to machie. oreover, Google s search rate is growig 20% per moth, ad the umber of pages idexed has more tha doubled every year sice 1997, so badwidth must be available for growth. Time of flight for messages across the Uited States takes about 0.1 secods, so it s importat for Europe to be served from the Virgiia site ad for Califoria to be served by Silico Valley sites. To try to achieve the goal of 1/2 secod latecy, Google software ormally guesses where the search is from i order to reduce time of flight delays.

82 8.12 uttig It All Together: The Goggle Cluster of Cs 643 Cost Give that the basic buildig block of the Google cluster is a C, the capital cost of a site is typically a fuctio of the cost of a C. Rather tha buy the latest microprocessor, Google looks for the best cost-performace. Thus, i arch 2000 a 800 Hz etium III cost about $800, while a 533 Hz Celero cost uder $200, ad the differece i performace could t justify the extra $600 per machie. (Whe you purchase Cs by the thousads, every $100 per C is importat.) By November the price of the 800 Hz etium III dropped to $200, so it was a better ivestmet. Whe accoutig for this careful buyig plus the eclosures ad power supplies, your authors estimate the C cost was $1300 to $1700. The switches cost about $1500 for the H Etheret switch ad about $100,000 each for the Foudry switches. If the racks themselves cost about $1000 to $2000 each, the total capital cost of a 40-rack site is about $4.5 to $6.0. Icludig 3200 microprocessors ad 0.8 terabytes of RA, the disk storage costs about $10,000 to $15,000 per terabyte. To put this ito perspective, the leadig performer for the TC-C database bechmark i August 2001 is a scaled up versio of the cluster from the last example. The hardware aloe costs about $10.8, which icludes 280 microprocessors, 0.5 terabytes of RA, ad 116 terabytes SCSI disks orgaized as RAI I. Igorig the RAI I overhead, disk storage costs about $93,000 per terabyte, about a factor of 8 higher tha Google despite havig 1/8 the umber of processors ad about 5/8 the RA. The Google rack with 80 Cs, with each C operatig at about 55 Watts, uses 4500 Watts i 10 square feet. It is cosiderably higher tha the 1000 Watts per rack expected by the collocatio sites. Each Google rack also uses 60 amps. As metioed above, reducig power per C is a major opportuity for the future of such clusters, especially as the cost per kilowatt hour is icreasig ad the cost per bits/secod is decreasig. Reliability Not surprisigly, the biggest failure i the Google C is software. O a average day, about 20 machies will be rebooted, ad that ormally solves the problem. To reduce the umber of cables per C as well as cost, Google has o ability to remotely reboot a machie. The software stops givig work to a machie whe it observes uusual behavior, ad the operator calls the collocatio site ad tells them to locatio of the machie that eeds to be rebooted, ad a perso at the site fids the label ad pushes the switch o the frot pael. Occasioally the perso hits the wrog switch either by mistake or due to mislabelig o the outside of the box. The ext reliability problem is the hardware, which has about 1/10th the failures of software. Typically, about 2% to 3% of the Cs have eed to be replaced per year, with failures due to disks ad RA accoutig for 95% of these failures. The remaiig 5% are due to problems with the motherboard, power supply, ad coectors, ad so o. The microprocessors themselves ever seem to fail.

83 644 Chapter 8 Itercoectio Networks ad Clusters The RA failures are perhaps a third of the failures. Google sees errors both to bits chagig iside RA ad whe bits trasfer over the 100 to 133 Hz bus. There was o ECC protectio available o C desktop motherboard chip sets i 2000, so it was ot used. The RA is determied to be the problem whe Liix caot be istalled with a proper check sum util the RA is replaced. As C motherboard chip sets become available, Google plas to start usig ECC both to correct some failures but, more importatly, to make it easier to see whe RAs fail. The extra cost of the ECC is trivial give the wide fluctuatio i RA prices: careful purchasig procedures are more importat tha whether or ot the I has ECC. isks are the remaiig C failures. I additio to the stadard failures that result is message to error log i the cosole, i almost equal umbers these disks will occasioally result i a performace failure, with o error message to the log. Istead of deliverig ormal read badwidths at 28 bytes/secod, disks will suddely drop to 4 B/secod or eve 0.8 B/secod. As the disks are uder warraty for 5 years, Google seds the disks back to the maufacture for either operatioal or performace failures to get replacemets. Thus, there has bee o exploratio of the reaso for the disk aomalies. Whe a C has problems, it is recofigured out of the system, ad about oce a week a perso removes the broke Cs. They are usually repaired ad the reiserted ito the rack. I regards to the switches, over a 2-year period perhaps 200 of the H Etheret switches were deployed, ad 2 to 3 have failed. Noe of the six Foudry switches has failed i the field, although some have had problems o delivery. These switches have a blade-based desig with 16 blades per switch, ad 2 to 3 of the blades have failed. The fial issue is collocatio reliability. The experiece of may Iteret service providers is that oce a year there will be a power outage that affects either the whole site or a major fractio of a site. O average, there is also a etwork outage so that the whole site is discoected from the Iteret. These outages ca last for hours. There also that collocatio site reliability follows a bathtub curve: high failures i the begiig, which quickly fall to low rates i the middle, ad the rises to high rates at the ed. Whe they are ew, the sites are empty ad so cotiuously filled with ew equipmet. With more people ad ew equipmet beig istalled, there is a higher outage rate. Oce the site is full of equipmet, there are fewer people aroud ad less chage, so the site has a low failure rate. Oce the equipmet becomes outdated ad it starts beig replaced, the activity i the site icreases ad so does the failure rate. Thus, the failure rate of site depeds i part o its age, just as the classic bathtub reliability curves would predict. It is also a fuctio of the people, ad if there is a turover i people, the fault rate ca chage. Google accommodates collocatio ureliability by havig multiple sites with differet etwork providers, plus leased lies betwee pairs of site for emerge-

84 8.13 Aother View: Iside a Cell hoe 645 cies. ower failures, etwork outages, ad so do ot affect the availability of the Google service Aother View: Iside a Cell hoe I 1999, there were 76 millio cellular subscribers i the Uited States, a 25% growth from the year before. That growth rate is almost 35% per year worldwide, as developig coutries fid it much cheaper to istall cellular towers tha copper-wire-based ifrastructure. Thus, i may coutries, the umber of cell phoes i use exceeds the umber of wired phoes i use. Not surprisigly, the cellular hadset market is growig at 35% per year, with about 280 millio cellular phoe hadsets sold i To put that i perspective, i the same year sales of persoal computers were 120 millio. These umbers mea that tremedous egieerig resources are available to improve cell phoes, ad cell phoes are probably leaders i egieerig iovatio per cubic ich. [Grice, 2000]. Before uveilig the aatomy of a cell phoe, let s try a short itroductio to wireless techology. Backgroud o Wireless Networks Networks ca be created out of thi air as well as out of copper ad glass, creatig wireless etworks. uch of this sectio is based o a report from the Natioal Research Coucil [1997]. A radio wave is a electromagetic wave propagated by a atea. Radio waves are modulated, which meas that the soud sigal is superimposed o the stroger radio wave that carries the soud sigal, ad hece is called the carrier sigal. Radio waves have a particular wavelegth or frequecy: they are measured either the legth of the complete wave or as the umber of waves per secod. Log waves have low frequecies ad short waves have high frequecies. F radio statios trasmit o the bad of 88 Hz to 108 Hz usig frequecy modulatios (F) to record the soud sigal. By tuig ito differet frequecies, a radio receiver ca pick up a specific sigal. I additio to A ad F radio, other frequecies are reserved for citize bad radio, televisio, pagers, air traffic cotrol radar, Global ositioig System, ad so o. I the Uited States, the Federal Commuicatios Commissio decides who gets to use frequecies ad for what purpose The bit error rate (BER) of a wireless lik is determied by the received sigal power, oise due to iterferece caused by the receiver hardware, iterferece from other sources, ad characteristics of the chael. Noise is typically proportioal to the radio frequecy badwidth, ad a key measure is the sigal-to-oise ratio (SNR) required to achieve a give BER. Figure 8.46 lists more challeges for wireless commuicatio.

85 646 Chapter 8 Itercoectio Networks ad Clusters Challege escriptio Impact ath loss Shadow fadig ultipath fadig Received power divided by trasmitted power; the radio must overcome sigal-to-oise ratio (SNR) of oise from iterferece. ath loss is expoetial i distace, ad depeds o iterferece if its above 100 meters Received sigal blocked by objects, buildigs outdoors or walls idoors; icrease power to improve received SNR. It depeds o the umber of objects ad their dielectric properties Iterferece betwee multiple versios of sigal that arrive at differet times, determied by time betwee fastest sigal ad slowest sigal relative to sigal badwidth. 1 Watt trasmit power, 1 GHz trasmit frequecy, 1 bits/sec data rate at 10-7 BER, distace betwee radios ca be 728 meters i free space vs. 4 meters i a dese jugle If trasmitter is movig, eed to chage trasmit power to esure received SNR i regio 900 Hz trasmit frequecy sigal power chages every 30 cm Iterferece Frequecy reuse, adjacet chael, arrow bad iterferece Requires filters, spread spectrum FIGURE 8.46 Challeges for wireless commuicatio. Typically, wireless commuicatio is selected because the commuicatig devices are mobile or because wirig is icoveiet, which meas the wireless etwork must rearrage itself dyamically. Such rearragemet makes routig more challegig. A secod challege is that wireless sigals are ot protected ad hece are subject to mutual iterferece, especially as devices move. ower is the aother challege for wireless commuicatio, both because the devices ted to be battery powered ad because ateas radiate power to commuicate ad little of it reaches the receiver. As a result, raw bit error rates are typically a thousad to a millio times higher tha copper wire. There are two primary architectures for wireless etworks: base-statio architectures ad peer-to-peer architectures. Base statios are coected by lad lies for loger distace commuicatio, ad the mobile uits commuicate oly with a sigle local base statio. eer-to-peer architectures allow mobile uits to commuicate with each other, ad messages hop from oe uit to the ext util delivered to the desired uit. Although peer-to-peer is more recofigurable, base statios ted to be more reliable sice there is oly oe hop betwee the device ad the statio. Cellular telephoy, the most popular example of wireless etworks, relies o radio with base statios. Cellular systems exploit expoetial path loss to reuse the same frequecy at spatially separated locatios, thereby greatly icreasig the umber of customers served. Cellular systems will divide a city ito ooverlapig hexagoal cells which use differet frequecies if earby, reusig a frequecy oly whe cells are far eough apart so that mutual iterferece is acceptable. At the itersectio of three hexagoal cells is a base statio with trasmitters ad ateas that is coected to a switchig office which coordiates hadoffs whe a mobile device leaves oe cell ad goes ito aother, as well as to accept ad place calls over lad lies. epedig o topography, populatio ad so o, the radius of a typical cell is two to te miles.

86 8.13 Aother View: Iside a Cell hoe 647 The Cell hoe Figure 8.47 shows the compoets of a radio, which is the heart of a cell phoe. Radio sigals are first received by the atea, the amplified, passed through a mixer, the filtered, demodulated, ad fially decoded. The atea acts as the iterface betwee the medium through which radio waves travel ad electroics of the trasmitter or receiver. Ateas ca be desiged to work best i particular directios, givig both trasmissio ad receptio directioal properties. odulatio ecodes iformatio i the amplitude, phase, or frequecy of the sigal to icrease its robustess uder impaired coditios. Radio trasmitters go through the same steps, just i the opposite order. Atea RF Amp ixer Filter emodulator ecoder FIGURE 8.47 A radio receiver cosists of a atea, radio frequecy amplifier, mixer, filters, demodulator, ad decoder. A mixer accepts two sigal iput ad forms a output sigal at the sum ad differece frequecies. Filters select a arrower bad of frequecies to pass o to the ext stage. odulatio ecodes iformatio to make it more robust. ecodig turs sigals ito iformatio. epedig o the applicatio, all electrical compoets ca be either aalog or digital. For example, a car radio is all aalog compoets, but C modem is all digital except for the amplifier. Today aalog silico chips are used for the RF amplifier ad first mixer i cellular phoes. Origially, all compoets were aalog, but over time most were replaced by digital compoets, requirig the radio sigal to be coverted from aalog to digital. The desire for flexibility i the umber of radio bads led to software routies replacig some of these fuctios i programmable chips, such as digital sigal processors. Because such processors are typically foud i mobile devices, emphasis is placed o performace per joule to exted battery life, performace per square millimeter of silico to reduce size ad cost, ad bytes per task to reduce memory size. Figure 8.48 shows the geeric block diagram of the electroics of a cell phoe hadset, with the S performig the sigal processig ad the microcotroller hadlig the rest of the tasks. Cell phoe hadsets are basically mobile computers actig as a radio. The iclude stadard I/O devices keyboard ad LC display plus a microphoe, speaker, ad atea for wireless etworkig. Battery efficiecy affects sales, both i stadby power whe waitig for a call ad i miutes of speakig.

87 648 Chapter 8 Itercoectio Networks ad Clusters Atea RF Receiver (Rx) RF Trasmitter (Tx) S Speaker icrophoe icrocotroller isplay Keyboard FIGURE 8.48 Block diagram of a cell phoe. The S performs the sigal processig steps of Figure 8.47, ad the microcotroller cotrols the user iterface, battery maagemet, ad call setup. (Based o Figure 1.3 of Groe ad Larso[2000]) Whe a cell phoe is tured o, the first task is to fid a cell. It scas the full badwidth to fid the strogest sigal, which it keeps doig every seve secods or if the sigals stregth drops, as its desiged to work from movig vehicles. It the picks a uused radio chael. The local switchig office registers the cell phoe ad records its phoe umber ad electroic serial umber, ad assigs it voice chael for the phoe coversatio. To be sure the cell phoe got the right chael, the base statio seds a special toe o it, which the cell phoe seds back to ackowledge it. The cell phoe times out after five secods of it does t hear supervisory toe, ad starts the process all over agai. The origial base statio makes a hadoff request to the icomig base statio as the sigal stregth drops offs. To achieve a two way coversatio over radio, frequecy bads are set aside for each directio, formig a frequecy pair or chael. The origial cellular base statios trasmitted at to (called the forward path) ad cell phoes trasmitted at Hz to Hz (called the reverse path), with the frequecy gap to keep them from iterferig with each other. Cells might have had betwee 4 ad 80 chaels. Chaels were divided ito setup chaels for call setup, ad voice chaels that hadle the data or voice traffic. The commuicatio is doe digitally, just like a modem, at 9,600 bits/secod. Sice wireless is a lossy medium, especially from a movig vehicle, the hadset sed each message is five times. To preserve battery life, the origial cell phoes typically trasmit at two sigal stregths--0.6 watts ad 3.0 watts--depedig o the distace to cell. This relatively low power ot oly allows smaller batteries ad thus smaller cell phoes, it aids frequecy reuse, which is key to cellular telephoy. Figure 8.49 shows a circuit board from a Ericsso digital phoe, with the compoets idetified. Note that the board cotais two processors. A Z-80 mi-

88 8.13 Aother View: Iside a Cell hoe 649 FIGURE 8.49 Circuit card from a Ericsso cell phoe. (From Brai [2000]) <<Redo with more subtle labels>> crocotroller is resposible for cotrollig the fuctios of the board, I/O with the keyboard ad display, ad coordiatig with the base statio. The S hadles all sigal compressio ad decompressio. I additio there are dedicated chips for Aalog-to-igital ad igital-to-aalog coversio, amplifiers, power maagemet, ad RF iterfaces. I 2001, a cell phoe has about 10 itegrated circuits, icludig parts made i exotic techologies like gallium arsiide ad silico germaium as well as to stadard COS. The ecoomics ad desire for flexibility will likely shrik this to a few chips, but it appears that a separate microcotroller ad S will be foud iside those chips, with code implemetig may of the fuctios. Cell hoe Stadards ad Evolutio Improved commuicatio speeds for cellular phoe were developed, with multiple stadards. Code divisio multiple access (CA), as oe popular example, uses a wider radio frequecy bad for a path tha the origial cellular phoes, called AS for Advaced obile hoe Service, a mostly aalog system. The wider frequecy makes it more difficult to block, ad is called spread spectrum. Other stadards are time divisio multiple access (TA) ad global system for mobile commuicatio (GS). These secod geeratio stadards CA, GS, ad TA are mostly digital.

89 650 Chapter 8 Itercoectio Networks ad Clusters The big differece for CA is that all callers share the same chael, which operates at a much higher rate, ad the distiguishes the differet calls by ecodig each oe uiquely. Each CA phoe call starts at 9600 bits/secod, it is the ecoded ad trasmitted as equal sized messages at 1.25 megabits/secod. Rather tha sed each sigal five times as i AS, each bit is stretched so that it takes eleve times the miimum frequecy, thereby accommodatig iterferece ad yet successful trasmissio. The base statio receives the messages its separates them ito the separate 9600 bits/secod streams for each call. To ehace privacy, CA uses pseudo-radom sequeces from a set of 64 predefied codes. To sychroize the hadset ad base statio so as to pick a commo pseudo-radom seed, CA relies o a clock from the Global ositioig System, which cotiuously trasmits a accurate time sigal. By carefully selectig the codes, the shared traffic souds like radom oise to the listeer. Hece, as more users share a chael there is more oise, ad the sigal to oise ratio gradually degrades. Thus, the capacity of the CA system is a matter of taste, depedig upo sesitivity of the listeer to backgroud oise. I additio, CA uses speech compressio ad varies the rate of data trasferred depedig how much activity is goig o i the call. Both these techiques preserve badwidth, which allows for more calls per cell. CA must regulate power carefully so that sigals ear the cell tower do ot overwhelm those from far away, with the goal of all sigals reach the tower at about the same level. The side beefit is that CA hadsets emit less power, which both helps battery life ad icreases capacity whe users are close to the tower. Thus, compared to AS, CA improves capacity of a system by up to a order of magitude, has better call quality, has better battery life, ad ehaces users privacy. After cosiderable commercial turmoil, there is a ew third geeratio stadard called Iteratioal obile Telephoy 2000 (IT-2000) which is based primarily o two competig versios of CA ad oe TA. This stadard may lead to cell phoes which work aywhere i the world Fallacies ad itfalls yths ad hazards are widespread with itercoectio etworks. This sectio has just a few warigs, so proceed carefully. itfall: Usig badwidth as the oly measure of etwork performace. ay etwork compaies apparetly believe that give sophisticated protocols like TC/I that maximize delivered badwidth, there is oly oe figure of merit for etworks. This may be true for some applicatios, such as video, where there is little iteractio betwee the seder ad the receiver. ay applicatios, however, are of a request-respose ature, ad so for every large message there must be oe or more small messages. Oe example is NFS.

90 8.14 Fallacies ad itfalls 651 Overhead (secs) Trasmissio (secs) Total time (secs) Size No. messages AT Etheret No. data bytes AT Etheret AT Etheret , ,817, , ,101, ,082, ,346, ,574, ,600, , ,860, , ,316, ,135, , ,150, , ,494, , ,578, ,762, ,621, , ,319, , ,184, ,152, , ,606, ,434, ,934, ,769, , ,390,688, Total 11,387, ,352,876, FIGURE 8.50 Total time o 10 bit Etheret ad a 155 bit AT, calculatig the total overhead ad trasmissio time separately. Note that the size of the headers eeds to be added to the data bytes to calculate trasmissio time. The higher overhead of the software driver for AT offset the higher badwidth of the etwork. These measuremets were performed i 1994 usig SARCstatio 10s, the Fore Systems SBA-200 AT iterface card ad he Fore Systems ASX-200 switch. (NFS measuremets take by ike ahli of U.C. Berkeley.) Figure 8.50 compares a shared 10 bits/secod Etheret LAN to a switched 155 bits/secod AT LAN for NFS traffic. Etheret drivers were better tued tha the AT drivers, such that 10 bits/s Etheret was faster tha 155 bits/s AT for payloads of 512 bytes or less. Figure 8.50 shows the overhead time, trasmissio time, ad total time to sed all the NFS messages over Etheret ad AT. The peak lik speed of AT is 15 times faster ad the measured lik speed for 8-KB messages is almost 9 times faster. Yet the higher overheads offset the beefits so that AT would trasmit NFS traffic oly 1.2 times faster.

91 652 Chapter 8 Itercoectio Networks ad Clusters itfall: Igorig software overhead whe determiig performace. Low software overhead requires cooperatio with the operatig system as well as with the commuicatio libraries. Figure 8.50 gives oe example. Aother example comes from supercomputers. The C-5 supercomputer had a software overhead of 20 µsecs to sed a message ad a hardware overhead of 0.5 microsecods. The Itel arago reduced the hardware overhead to just 0.2 microsecods, but the iitial release of software has a software overhead of 250 microsecods. Later releases reduced this overhead to 25 microsecods, which still domiates the hardware overhead. This pitfall is simply Amdahl s Law applied to etworks: Faster etwork hardware is superfluous if there is ot a correspodig decrease i software overhead. itfall: Tryig to provide features oly withi the etwork vs. ed-to-ed. The cocer is providig features at a lower level that oly partially satisfy the commuicatio demad that ca oly be accomplished at the highest level. Saltzer, Reed, ad Clark [1984] give the ed-to-ed argumet as The fuctio i questio ca completely ad correctly be specified oly with the kowledge ad help of the applicatio stadig at the edpoits of the commuicatio system. Therefore, providig that questioed fuctio as a feature of the commuicatio system itself is ot possible. [page 278] Their example of the pitfall was a etwork at IT that used several gateways, each of which added a checksum from oe gateway to the ext. The programmers of the applicatio assumed the checksum guarateed accuracy, icorrectly believig that the message was protected while stored i the memory of each gateway. Oe gateway developed a trasiet failure that swapped oe pair of bytes per millio bytes trasferred. Over time the source code of oe operatig system was repeatedly passed through the gateway, thereby corruptig the code. The oly solutio was to correct the ifected source files by comparig to paper listigs ad repairig the code by had! Had the checksums bee calculated ad checked by the applicatio ruig o the ed systems, safety would have bee assured. There is a useful role for itermediate checks, however, provided that ed-toed checkig is available. Ed-to-ed checkig may show that somethig is broke betwee two odes, but it does t poit to where the problem is. Itermediate checks ca discover the broke compoet. A secod issue regards performace usig itermediate checks. Although it is sufficiet to retrasmit the whole i case of failures from the ed poit, it ca be much faster to retrasmit a portio of the message at a itermediate poit rather tha wait for time-out ad a full message retrasmit at the ed poit. Balakrisha et al [1997] foud that, for wireless etworks, such a itermediate retrasmissio for TC/I commuicatio results i 10-30% higher throughput.

92 8.15 Cocludig Remarks 653 itfall: Relyig o TC/I for all etworks, regardless of latecy, badwidth, or software requiremets. The etwork desigers o the first workstatios decided it would be elegat to use a sigle protocol stack o matter where the destiatio of the message: across a room or across a ocea, the TC/I overhead must be paid. This might have bee a wise decisio especially give the ureliability of early Etheret hardware, but it sets a high software overhead barrier for commercial systems. Such a obstacle lowers the ethusiasm for low-latecy etwork iterface hardware ad low-latecy itercoectio etworks if the software is just goig to waste hudreds of microsecods whe the message must travel oly dozes of meters. It also ca use sigificat processor resources. Oe rough rule of thumb is that each bits/secod of TC/I badwidth eeds about 1 Hz of processor speed, ad so a 1000 bits/secod lik could saturate a processor with a 800 to 1000 Hz clock. The flip side is that from a software perspective, TC/I is the most desirable target sice it is the most coected ad hece largest umber of opportuities. The dowside of usig software optimized to a particular LAN or SAN is that it is limited. For example, commuicatio from a Java program depeds o TC/I, so optimizatio for aother protocol would require creatio of glue software to iterface Java to it. TC/I advocates poit out that the protocol itself is theoretically ot as burdesome as the curret implemetatios, but progress has bee modest i commercial systems. There are also TC/I off-loadig egies eterig the market, with the hope of preservig the uiversal software model while to reducig processor utilizatio ad message latecy. If processors to cotiue to improve much faster tha etwork speeds, or if multiple processors become ubiquitous, software TC/I may become less sigificat o processor utilizatio ad message latecy Cocludig Remarks Networkig is oe of the most excitig fields i computer sciece ad egieerig today. The purpose of this chapter to lower the cost of etry ito this field by providig defiitios ad the basic issues so that readers ca more easily go ito more depth. The Iteret ad World Wide Web pervade our society ad will likely revolutioize how we access iformatio. Although we could t have the Iteret without the telecommuicatio media, it is protocol suites such as TC/I that make electroic commuicatio practical. ore tha most areas of computer sciece ad egieerig, these protocols embrace failures as the orm; the etwork must operate reliably i the presece of failures. Itercoectio etwork hardware ad software bled telecommuicatios with data commuicatios, callig ito

93 654 Chapter 8 Itercoectio Networks ad Clusters questio whether they should remai as separate academic disciplies or be combied ito a sigle field. The silico revolutio has made its way to the switch: just as the killer micro chaged computig, whatever turs out to be the killer etwork will trasform commuicatio. We are seeig the same dramatic chage i cost/ performace i switches as the maiframe-miicomputer-microprocessor chage did to processors. I 2001, compaies that make switches are acquirig compaies that make embedded microprocessors, just to have better microprocessors for their switches. Iexpesive switches mea that etwork badwidth ca scale with the umber of odes, eve to the level of the traditioal I/O bus. Both I/O desigers ad memory system desigers must cosider how to best select ad deploy switches. Thus, etworkig issues apply to all levels of computers systems today: commuicatio withi chips, betwee chips o a board, betwee boards, ad betwee computers i a machie room, o a campus, or i a coutry. The availability ad scalability of etworks are trasformig the machie room. isks are beig coected over SAN to servers versus beig directly attached, ad clusters of smaller computers coected by a LAN are replacig large servers. The cost-performace, scalability, ad fault isolatio of clusters have made them attractive to diverse commuities: database, scietific computig, ad Iteret service providers. It s hard to thik what else these commuities have i commo. The challeges for clusters today are the cost of admiistratio. After decades of low etwork performace o shared media, etworkig is i catch up mode, ad should improve faster tha microprocessors. We are ot ear ay performace plateaus, so we expect rapid advace SANs, LANs, ad WANs. This greater etwork performace is key to the iformatio ad commuicatio cetric visio of the future of our field. The dramatic improvemet i cost/ performace of commuicatios has eabled millios of people aroud the world to fid others with commo iterests. As the quotes at the begiig of this chapter suggest, the authors believe this revolutio i two-way commuicatio will chage the form of huma associatios ad actios Historical erspective ad Refereces This chapter has take the uusual perspective that computers iside the machie room o a LAN or SAN ad computers o a itercotietal WAN share may of the same cocers. Although this observatio may be true, their histories are very differet. We highlight readigs o each topic, but good geeral texts o etworkig have bee writte by avie, eterso, ad Clark [1999] ad by Kurose ad Ross [2001].

94 8.16 Historical erspective ad Refereces 655 Wide Area Networks The earliest of the data itercoectio etworks are WANs. The foreruer of the Iteret is the ARANET, which i 1969 coected computer sciece departmets across the U.S. that had research grats fuded by the Advaced Research roject Agecy (ARA), a U.S. govermet agecy. It was origially evisioed as usig reliable commuicatios at lower levels. It was the practical experiece with failures of uderlyig techology that led to the failure-tolerat TC/I, which is the basis for the Iteret today. Vit Cerf ad Robert Kah are credited with developig the TC/I protocols i the mid 1970s, wiig the AC Software Award i recogitio of that achievemet. Kah [1972] is a early referece o the ideas of ARANET. For those iterested i learig more about TC/I, Steves [1994] has writte classic books o the topic. I 1975, there were roughly 100 etworks i the ARANET ad oly 200 i 1983; i 1995 the Iteret ecompasses 50,000 etworks worldwide, about half of which are i the Uited States. I 2000, that umber is hard to calculate, but the umber of I hosts grew by a factor of 20 i five years. The key etworks that made the Iteret possible, such as ARANET ad NSFNET, have bee replaced by fully commercial systems, ad yet the Iteret still thrives. The excitig applicatio of the Iteret is the World Wide Web, developed by Tim Berers-Lee, a programmer at the Europea Ceter for article Research (CERN) i 1989 for iformatio access. I 1992, a youg programmer at the Uiversity of Illiois, arc Adreesse, developed a graphical iterface for Web called osaic. It became immesely popular. He later became a fouder of Netscape, which popularized commercial browsers. I ay 1995, at the time of the secod editio of this book, there were over 30,000 web pages, ad the umber was doublig every two moths. I November 2000, durig the writig of the third editio of this book, there were almost 100 millio Iteret hosts ad more tha 1.3 billio WWW pages. Alles [1995] offers a good survey o AT. AT is just the latest of the ogoig stadards set by the telecommuicatios idustry, ad it is udoubtedly the future for this commuity. Commuicatio forces stadardizatio by competitive compaies, sometimes leadig to aomalies. For example, the telecommuicatio compaies i North America wated to use 64-byte packets to match their existig equipmet, while the Europeas wated 32-byte packets to match their existig equipmet. The ew stadard compromise was 48 bytes to esure that either group had a advatage i the marketplace! Fially, WANs today rely o fiber. Fiber has made such advaces that it s origial assumptio of packet switchig is o loger true: WAN badwidth is ot precious. Today WAN fibers are ofte uderutilized. Goralski [1997] discusses advaces i fiber optics.

95 656 Chapter 8 Itercoectio Networks ad Clusters Local Area Networks ARA s success with wide area etworks led directly to the most popular local area etworks. ay researchers at Xerox alo Alto Research Ceter had bee fuded by ARA while workig at uiversities, ad so they all kew the value of etworkig. I 1974, this group iveted the Alto, the foreruer of today s desktop computers [Thacker et al. 1982], ad the Etheret [etcalfe ad Boggs 1976], today s LAN. This group--avid Boggs, Butler Lampso, Ed ccreight, Bob Sprowl, ad Chuck Thacker--became lumiaries i computer sciece ad egieerig, collectig a treasure chest of awards betwee them. This first Etheret provided a 3 bits/sec itercoectio, which seemed like a ulimited amout of commuicatio badwidth with computers of that era. It relied o the itercoect techology developed for the cable televisio idustry. Special microcode support gave a roud-trip time of 50 microsecods for the Alto over Etheret, which is still a respectable latecy. It was Boggs experiece as a ham radio operator that led to a desig that did ot eed a cetral arbiter, but istead listeed before use ad the varied back-off times i case of coflicts. The aoucemet by igital Equipmet Corporatio, Itel, ad Xerox of a stadard for 10 bits/sec Etheret was critical to the commercial success of Etheret. This aoucemet short-circuited a legthy IEEE stadards effort, which evetually did publish IEEE as a stadard for Etheret. There have bee several usuccessful cadidates i tryig to replace the Etheret. The FI committee, ufortuately, took a very log time to agree o the stadard ad the resultig iterfaces were expesive. It was also a shared medium whe switches are becomig affordable. AT also missed the opportuity due i part to the log time to stadardize the LAN versio of AT. ue to failures of the past, LAN moderizatio efforts have bee cetered o extedig Etheret to lower cost media, to switched itercoect, to higher lik speeds, ad to ew domais such as wireless commuicatio. Spurgeo [2001] has a ice o-lie summary of Etheret techology, icludig some of its history. assively arallel rocessors Oe of the places of iovatio i itercoect etworks was i massively parallel processors (s). A early was the Cosmic Cube [Seitz 1985], which used Etheret iterface chips to coect 8086 computers i a hypercube. SAN itercoectios have improved cosiderably sice the, with messages routed automatically through itermediate switches to their fial destiatios at high badwidths ad with low latecy. Cosiderable research has goe ito the beefits over differet topologies i both costructio ad program behavior. Whether due to faddishess or chages i techology is hard to say, but topologies certaily become very popular ad the disappear. The hypercube, widely popular i the 1980s, almost disappeared from s of the 1990s. Cut-through routig, however, has bee preserved ad is covered by ally ad Seitz [1986].

96 8.16 Historical erspective ad Refereces 657 Chapter 6 records the poor curret state of such machies. Govermet programs such as the Accelerated Strategic Computig Iitiative (ASCI) still result i a hadful of oe-of-a-kid s costig $50 to $100 millio, yet these are basically clusters of Ss. Clusters Clusters were probably iveted i the 1960s by customers who could ot fit all there work i oe computer, or who eeded a backup machie i case of failure of the primary machie [fister, 1998]. Tadem itroduced a 16-ode cluster i igital followed with VAXclusters, itroduced i They were origially idepedet computers that shared I/O devices, requirig a distributed operatig system to coordiate activity. Soo they had commuicatio liks betwee computers, i part so that the computers could be geographically distributed to icrease availability i case of a disaster at a sigle site. Users log oto the cluster ad are uaware of which machie they are ruig o. EC (ow Compaq) sold more tha 25,000 clusters by Other early compaies were Tadem (ow Compaq) ad IB (still IB), ad today virtually every compay has cluster products. ost of these products are aimed at availability, with performace scalig as a secodary beefit. Yet i 2000 clusters geerally domiate the list of top performers of the TC-C database bechmark Scietific computig o clusters emerged as a competitor to s. I 1993, the Beowulf roject started with the goal of fulfillig NASA s desire for a 1 GFLOS computer for uder $50,000. I 1994, a 16-ode cluster build from off the shelf Cs usig 80486s achieved that goal [Bell 2001.] This emphasis led to a variety of software iterfaces to make it easier to submit, coordiate, ad debug large programs or a large umber of idepedet programs. I 2001, the fastest (ad largest) supercomputers are typically clusters, at least by some popular measures. Efforts were made to reduce latecy of commuicatio i clusters as well as to icrease badwidth, ad several research projects worked o that problem. (Oe commercial result of the low latecy research was the VI iterface stadard, which has bee embraced by Ifiibad, discussed below.) Low latecy the proved useful i other applicatios. For example, i 1997 a cluster of 100 UltraS- ARC desktop computers at UC Berkeley, coected by 160-B/sec per lik yriet switches, was used to set world records i database sort sortig 8.6 GB of data origially o disk i oe miute ad i crackig a ecrypted message takig just 3.5 hours to decipher a 40-bit ES key. This research project, called Network of Workstatios [Aderso et al, 1995], also developed the Iktomi search egie, which led to a startup compay with the same ame. For those iterested i learig more, fister [1998] has writte a etertaiig book o clusters. I eve greater details, Sterlig [2001] has writte a do-ityourself-book o how to build a Beowulf cluster.

97 658 Chapter 8 Itercoectio Networks ad Clusters System or Storage Area Networks (SANs) At the secod editio of this book, a ew class of etworks was emergig: system area etworks. These etworks are desiged for a sigle room or sigle floor ad thus the legth is te to hudreds of meters, ad were for use i clusters. Close distace meas the wires ca be wider ad faster at lower cost, etwork hardware ca esure i order delivery, ad cascadig switches cosume less hadshakig time. There is also less reaso to go to the cost of optical fiber, sice the distace advatage of fiber is less importat for SANs. The limited size of the etworks also makes source-based routig plausible, further simplifyig the etwork. Both Tadem Computers ad yricom sold SANs. I the iterveig years the acroym SAN has bee co-opted to also mea storage area etworks, whereby etworkig techology is used to coect storage devices to compute servers. Today most people mea storage whe they say SAN. The most widely used example i 2001 is Fibre Chael Arbitrated Loop (FC-AL). Not oly are disk arrays attached to servers via FC-AL liks, there are eve some disks with FC-AL liks. There are also compaies sellig FC-AL switches so that storage area etworks ca ejoy the beefits of greater badwidth ad itercoectivity of switchig. I October 2000 the Ifiibad Trade Associatio aouced versio 1.0 specificatio of Ifiibad. Led by Itel, H, IB, Su, ad other compaies, it was proposed as a successor to the CI bus that brigs poit-to-poit liks ad switches with its ow set of protocols. It s characteristics are desirable potetially both for system area etworks to coect clusters ad for storage area etworks to coect disk arrays to servers. To lear more, the Iifibad stadard [2001] is available o the WWW. The chief competitio for Ifiibad is the rapidly improvig Etheret techology. The Iteret Egieerig Task Force is proposig a stadard called iscsi to sed SCSI commad over I etworks (Satra[2001]). Give the likely cost advatages of the higher volume Etheret switches ad iterface cards, i 2001, it is uclear who will wi. Will Ifiibad take over the machie room, leavig the WAN as the oly lik that is ot Ifiibad? Or will Etheret will domiate the machie room, eve takig over some of the role of storage area etworks, leavig Ifiibad to simply be a I/O bus replacemet? Or will there be a three-level solutio: Ifiibad i the machie room, Etheret i the buildig ad o the campus, ad the WAN for coutry? Will TC/I off-loadig egies become available that ca reduce processor utilizatio ad provide low latecy yet still provide the software iterfaces ad geerality of TC/I? Or will software TC/I ad faster multiprocessors be sufficiet? I 2001, it is very hard to tell which will wi. A woderful characteristic of computer architecture is that such issues will ot remai edless academic debates, uresolved as people rehash the same argumets repeatedly. Istead, the battle is fought i the marketplace, with well-fuded ad taleted groups givig

98 8.16 Historical erspective ad Refereces 659 their best efforts at shapig the future. oreover, costat chages to techology reward those who are either astute or lucky. The best combiatio of techology ad follow-through has ofte determied commercial success. Let the games begi! Time will tell us who wis ad who loses, ad we will likely kow the score by the ext editio of this text. Refereces ALLES, A. [1995]. AT Iteretworkig, (ay), ANERSON, T. E.,. E. CULLER,. ATTERSON [1995]. A CASE FOR NOW (NETWORKS OF WORK- STATIONS), IEEE ICRO 15:1 (FEBRUARY), ARACI, R. H.,. E. CULLER, A. KRISHNAURTHY, S. G. STEINBERG, AN K. YELICK [1995]. Empirical evaluatio of the CRAY-T3: A compiler perspective, roc. 23rd It l Symposium o Computer Architecture (Jue), Italy. Balakrisha, H.; admaabha, V.N.; Sesha, S.; Katz, R.H. [1997] A compariso of mechaisms for improvig TC performace over wireless liks. IEEE/AC Trasactios o Networkig, vol.5, (o.6), ec., Brai,. [2000] Iside a igital Cell hoe, BREWER, E. A. AN B. C. KUSZAUL [1994]. How to get good performace from the C-5 data etwork. roc. Eighth It l arallel rocessig Symposium (April), Cacu, exico. Bri, S.; age, L. [1998]. The aatomy of a large-scale hypertextual Web search egie. roc. 7th Iteratioal World Wide Web Coferece, Brisbae, Qld., Australia, (14-18 April), COER,. [1993]. Iteretworkig with TC/I, 2d ed., retice Hall, Eglewood Cliffs, N.J. ALLY, W. J. AN C. I. SEITZ [1986]. The torus routig chip, istributed Computig 1:4, avie, B. S., L. L. eterso, ad. Clark [1999] Computer Networks: A Systems Approach, secod editio, orga Kaufma ublishers, Sa Fracisco. ESURVIRE, E. [1992]. Lightwave commuicatios: The fifth geeratio, Scietific America (Iteratioal Editio) 266:1 (Jauary), Grice, C. ad. Kaellos [2000] Cell phoe idustry at crossroads: Go high or low?, CNET News, August 31, Goralski, W.[1997]. SONET : a guide to Sychroous Optical Network, New York : cgraw-hill. Groe, Joh B. amd Lawrece E. Larso.[2000] CA mobile radio desig, Bosto : Artech House. IfiiBad Trade Associatio [2001]. IfiiBad Architecture Specificatios Release 1.0.a, KAHN, R. E. [1972]. Resource-sharig computer commuicatio etworks, roc. IEEE 60:11 (November), Kurose, J. F. ad K. W. Ross [2001]. Computer etworkig : a top-dow approach featurig the Iteret, Addiso-Wesley, Bosto ETCALFE, R.. [1993]. Computer/etwork iterface desig: Lessos from Arpaet ad Etheret. IEEE J. o Selected Areas i Commuicatios 11:2 (February), ETCALFE, R.. AN. R. BOGGS [1976]. Etheret: istributed packet switchig for local computer etworks, Comm. AC 19:7 (July), Natioal Research Coucil [1997]. The evolutio of utethered commuicatios, Computer Sciece ad Telecommuicatios Board, Washigto,.C. : Natioal Academy ress.

99 660 Chapter 8 Itercoectio Networks ad Clusters ARTRIGE, C. [1994]. Gigabit Networkig. Addiso-Wesley, Readig, ass. fister, Gregory F. [1998]I search of clusters, 2d ed. Upper Saddle River, NJ : retice Hall TR. SALTZER, J. H.,.. REE,.. CLARK [1984]. Ed-to-ed argumets i system desig, AC Tras. o Computer Systems 2:4 (November), SEITZ, C. L. [1985]. The Cosmic Cube (cocurret computig), Commuicatios of the AC 28:1 (Jauary), Sterlig, T. [2001]. Beowulf C Cluster Computig with Widows ad Beowulf C Cluster Computig with Liux, IT ress, Cambridge, A. Spurgeo, C. [2001] Charles Spurgeo's Etheret Web Site, wwwhost.ots.utexas.edu/etheret/etheret-home.html. Satra, J et. al.[2001] "iscsi," IS workig group of IETF, Iteret draft Steves, W. R. [ ]. TC/I illustrated, (three volumes) Addiso-Wesley ub. Co., Readig, ass.. TANENBAU, A. S. [1988]. Computer Networks, 2d ed., retice Hall, Eglewood Cliffs, N.J. THACKER, C.., E.. CCREIGHT, B. W. LASON, R. F. SROULL, AN. R. BOGGS [1982]. Alto: A persoal computer, i Computer Structures: riciples ad Examples,.. Siewiorek, C. G. Bell, ad A. Newell, eds., cgraw-hill, New York, WALRAN, J. [1991]. Commuicatio Networks: A First Course, Akse Associates: Irwi, Homewood, Ill. EXERCISES Usig the examples from sectio 8.11, use the techiques from Chapter 7 to calculate the reliability of the cluster. The results of failrues o Tertiary isk give oe set of failure iformatio. What is the TTF? Where are the sigle poits of failure? How could the desigs be chaged to improve TTF? Alog similar lies, calculate the performace bottleecks? How does it chage if we use rules of thumb o utilizatio for Chapter 7 vs. assumig 100% utilizatio? The SAN versios just use FC-AL loops versus addig a FC-AL switch. What would have to chage i the disk system to make a FC-AL switch valuable? (RAI is the bottleeck with oly a sigle FC-AL loop betwee the box ad the server.) Udoubtedly the top 10 of TC-C has chaged. Fid a cluster from ell or Compaq, ad go to their web sites to determie the prices of the varyig cluster strategies as we did i the examples. Note that the execute overview lists all the compoets ad their prices at the time of the bechmark. They ca serve as good placeholders util or uless you ca fid the curret real prices olie. They also supply maiteace costs. Add a discussio questio o use of Etheret vs.ifiibad i the machie room. What are the techical advatages of each? What are the ecoomic ad-

100 Exercises 661 vatages of each? Why would people maitaiig the system prefer oe to the other? I all exercises, should go to faster Etheret (ad AT). We could use 5 to 10 more exercises. Some simple oes: go to the TC web site ad look at which architectures-- clusters vs. some form of multiprocessors--domiate each bechmark i performace ad i cost performace. ake a discussio questio as to why this might vary betwee bechmarks. How has it chaged sice the data i the figure? Have the treds cotriued, or ot? o a similar study for the Lipack bechmarks (List of Top 500 supercomputers). See if there is older versios of the list so you ca see how machie types ad brad ames chage over time. How has it chaged sice the data i the figure? Have the treds cotriued, or ot? If you have access to a S ad a cluster, write a program to measure latecy of commuicatio ad badwidth of commuicatio betwee processors. 8.1 [15] <8.2> Assume the overhead to sed a zero-legth data packet o a Etheret is 500 microsecods ad that a uloaded etwork ca trasmit at 90% of the peak 10 bits/ sec ratig. lot the delivered badwidth as the data trasfer size varies from 32 bytes to Chage Etheret speed i the ext oe. Figure still there? 8.2 [15] <8.2> Oe reaso that AT has a fixed trasfer size is that whe a short message is behid a log message, a ode may eed to wait for a etire trasfer to complete. For applicatios that are time-sesitive, such as whe trasmittig voice or video, the large trasfer size may result i trasmissio delays that are too log for the applicatio. O a uloaded itercoectio, what is the worst-case delay if a ode must wait for oe full-size Etheret packet versus a AT trasfer? See Figure 8.20 (page 605) to fid the packet sizes. For this questio assume you ca trasmit at 100% of the 155 bits/sec of the AT etwork ad 100% of the 10 bits/sec Etheret. Update ext oe to larger tapes, speeds. atch assumptios i revised example? 8.3 [20/10] <8.3>Is electroic commuicatio always fastest for loger distaces tha the Example o page 583? Calculate the time to sed 100 GB usig 10 8-mm tapes ad a overight delivery service versus sedig 100 GB by FT over the Iteret. ake the followig four assumptios: The tapes are picked up at 4.. acific time ad delivered 4200 km away at 10 A.. Easter time (7 A.. acific time). O oe route the slowest lik is a T1 lie, which trasfers at 1.5 bits/sec. O aother route the slowest lik is a 10 bits/sec Etheret.

101 662 Chapter 8 Itercoectio Networks ad Clusters You ca use 50% of the slowest lik betwee the two sites. a. [20] <8.3> Will all the bytes set by either Iteret route arrive before the overight delivery perso arrives? b. [10] <8.3> What is the badwidth of overight delivery? Calculate the average badwidth of overight delivery service for a 100-GB package. erhaps a ext exercise ca add badwidth of etworkig liks o campus ad over the iteret. ary Baker at Staford has created a ew set of software that is much more efficiet at fidig badwidth. Latecy ca be figured out from pig ad traceroute (I recall). I ca imagie several exercises alog these lies. 8.4 [20/20/20/20] <8.8> If you have access to a UNIX system, use pig to explore the Iteret. First read the maual page. The use pig without optio flags to be sure you ca reach the followig sites. It should say that X is alive. epedig o your system, you may be able to see the path by settig the flags to verbose mode (-v) ad trace route mode (-R) to see the path betwee your machie ad the example machie. Alteratively, you may eed to use the program traceroute to see the path. If so, try its maual page. You may wat to use the UNIX commad script to make a record of your sessio. a. [20] <8.8> Trace the route to aother machie o the same local area etwork. b. [20] <8.8> Trace the route to aother machie o your campus that is ot o the same local area etwork. c. [20] <8.8> Trace the route to aother machie off campus. For example, if you have a fried you sed to, try tracig that route. See if you ca discover what types of etworks are used alog that route. d. [20] <8.8> Oe of the more iterestig sites is the curdo NASA govermet statio i Atarctica. Trace the route to mcmvax.mcmurdo.gov. Chage ext to Etheret example? 8.5 [12/15/15] <8.4> Assume 64 odes ad AT switches i the followig. (This exercise was suggested by ark Hill.) a. [12] <8.4> esig a switch topology that has the miimum umber of switches. b. [15] <8.4> esig a switch topology that has the miimum latecy through the switches. Assume uit delay i the switches ad zero delay for wires. c. [15] <8.4> esig a switch topology that balaces the badwidth required for all liks. Assume a uiform traffic patter. I thik this example was dropped? 8.6 [20] <8.4> Redo the cut-through routig calculatio for C-5 o page 590 of differet sizes: 64, 256, ad 1024 odes. I dropped the all-to-all example. erhaps put i as exercise? Reword ad see if this exercise still makes sese 8.7 [15] <8.4> Calculate the time to perform a broadcast (from-oe-to-all) o each of the

102 Exercises 663 topologies i Figure 8.17 o page 598, makig the same assumptios as the two Examples o pages I dropped the all-to-all example. Reword ad see if this exercise still makes sese 8.8 [20] <8.4> The two Examples o pages assumed ulimited badwidth betwee the ode ad the etwork iterface. Redo the calculatios i Figure 8.17 o page 598, this time assumig a ode ca oly issue oe message i a time uit. 8.9 [15] <8.4> Compare the itercoectio latecy of a crossbar, Omega etwork, ad fat tree with eight odes. Use Figure 8.13 o page 593 ad add a fat tree similar to Figure 8.14 o page 595 as a third optio. Assume that each switch costs a uit time delay. Assume the fat tree radomly picks a path, so give the best case ad worst case for each example. How log will it take to sed a message from ode 0 to 6? How log will it take 1 ad 7 to also commuicate? figure i ext exercise was dropped. Replacig it with 8.50 o page 651 requires chagig the questio. erhaps use the data i the figure to first calculate what is the delivered bits/sec (accoutig for overheads) for each etwork for each size of NSF payload. The ask what is 1/2 for each etwork give those overheads. <<Figure below is goe, so pick aother example?>> 8.10 [15] <8.4> Oe iterestig measure of the latecy ad badwidth of a itercoectio is to calculate the size of a message eeded to achieve oe-half of the peak badwidth. This halfway poit is sometimes referred to as 1/2, take from the vector processig. Usig Figure 7.36 o page 621, estimate 1/2 for TC/I message usig AT ad the Etheret [15] <8.8> Use FT to trasfer a file from a remote site ad the betwee local sites o the same LAN. What is the differece i badwidth for each trasfer? Try the trasfer at differet times of day or days of the week. Is the WAN or LAN the bottleeck? 8.12 [15] <8.4> raw the topology of a 6-cube similar to the drawig of the 4-cube i Figure 8.16 o page [12/12/12/15/15/18] <8.7> Use //1 queuig model to aswer this exercise. easuremets of a etwork bridge show that packets arrive at 200 packets per secod ad that the gateway forwards them i about 2 ms. a. [12] <8.7> What is the utilizatio of the gateway? b. [12] <8.7> What is the mea umber of packets i the gateway? c. [12] <8.7> What is the mea time spet i the gateway? d. [15] <8.7> lot the respose time versus utilizatio as you vary the arrival rate. e. [15] <8.7> For a //1 queue, the probability of fidig or more tasks i the system is Utilizatio. What is the chace of a overflow of the FIFO if it ca hold 10 messages? f. [18] <8.7> How big must the gateway be to have packet loss due to FIFO overflow to

103 664 Chapter 8 Itercoectio Networks ad Clusters be less tha oe packet per millio? 8.14 [20] <8.7> The imbalace betwee the time of sedig ad receivig ca cause problems i etwork performace. Sedig too fast ca cause the etwork to back up ad icrease the latecy of messages, sice the receivers will ot be able to pull out the message fast eough. A techique called badwidth matchig proposes a simple solutio: Slow dow the seder so that it matches the performace of the receiver [Brewer 1994]. If two machies exchage a equal umber of messages usig a protocol like U, oe will get ahead of the other, causig it to sed all its messages first. After the receiver puts all these messages away, it will the sed its messages. Estimate the performace for this case versus a badwidth-matched case. Assume the sed overhead is 200 microsecods, the receive overhead is 300 microsecods, time of flight is 5 microsecods, ad latecy is 10 microsecods, ad that the two machies wat to exchage 100 messages [40] <8.7> Compare the performace of U with ad without badwidth matchig by slowig dow the U sed code to match the receive code as advised by badwidth matchig [Brewer 1994]. evise a experimet to see how much performace chages as a result. How should you chage the sed rate whe two odes sed to the same destiatio? What if oe seder seds to two destiatios?

104 8.16 Historical erspective ad Refereces 665