Paper"Goals" A"Scalable,"Commodity"Data" Center"Network"Architecture" Al9Fares,"Loukissas,"Vahdat,""A"Scalable,"Commodity" Data"Center"Network"Architecture,""Proc."of"ACM" SIGCOMM"'08,"38(4):63974,"Oct."2008." " Presenter:"William"Beyer" Point"out"faults"with"current"data"center" designs" Propose"new"architecture"based"on"fat9tree" Scalable"interconnecUon"bandwidth" Economies"of"scale" Backward"compaUbility" A"Typical"Data"Center" Data"center"topology"is"typically"293"level"tree" of"switches"and"routers" OversubscripUon" RaUo"of"worst9case"achievable"aggregate" bandwidth"among"end9hosts"to"the"total" bisecuon"bandwidth"of"the"network"topology" Ability"of"hosts"to"fully"uUlize"their"uplink"capacity" 1:1" "All"hosts"can"use"full"uplink"capacity" 5:1" "Only"20%"of"host"bandwidth"may"be" available" Typical"raUo"is"2.5:1"(400"Mbps)"to"8:1"(125" Mbps)"
MulU9path"RouUng" Cost"Analysis" MulU9rooted "tree"required"to"communicate" at"full"bandwidth"for"large"clusters" Otherwise"limited"to"max"bandwidth"of"a"single" expensive"switch"(1289port"10"gige)" Use"mulU9path"rouUng"technique"such"as" ECMP" Performs"staUc"load"splidng,"cannot"account"for" flow"sizes" RouUng"tables"become"very"large"with"mulUple" paths" Cost"Analysis" Fat9tree"Architecture" k9ary"fat9tree:"three9layer"topology"(edge," aggregauon,"core)" k"pods,"each"consists"of"(k/2) 2 "hosts"and"two"layers" (edge/aggregate)"each"with"k/2"k9port"switches" Each"edge"switch"connects"to"k/2"hosts"and"k/2" aggregate"switches" Each"aggregate"switch"connects"to"k/2"edge"and"k/2" core"switches" (k/2) 2 "core"switches:"each"connects"to"k"pods" Supports"k 3 /4"hosts!"
Fat9tree"Topology"with"k"="4" Issues"with"Fat9tree"Topologies" Backwards"compaUble"with"IP/Ethernet" Good"thing,"but"rouUng"algorithms"will"naively" choose"a"single"shortest"path"to"use"between" subnets" Leads"to"boilenecks"quickly" (k/2) 2 "shortest"paths"available,"should"use"them" all"equally" Complex"wiring"due"to"lack"of"high"speed" ports" Addressing"in"Fat9tree" Use"10.0.0.0/8"private"addressing"block" Pod"switches"have"address"10.pod.switch.1" Pod"and"switch"in"[0,"k91]"based"on"posiUon" Core"switches"have"address"10.k.j.i" i"and"j"denote"core"posiuon"in"(k/2) 2 "core"switches" Hosts"have"address"10.pod.switch.ID" ID"is"host"ID"in"switch"subnet"([2,"(k/2)"+"1])" k"<"256,"this"scheme"does"not"scale"indefinitely" Two9Level"Lookup"Table" Prefixes"used"for"forwarding"intra9pod"traffic" Suffixes"used"for"forwarding"inter9pod"traffic"
Two9Level"Lookup"ImplementaUon" Implemented"in"hardware"using"a"TCAM" Can"perform"parallel"lookups"across"table" Stores"don t"care"bits,"suitable"for"storing"variable" length"prefixes" Prefixes"preferred"over"suffixes" RouUng"Algorithm" Prefixes"in"two9level"table"prevent"intra9pod" traffic"from"leaving"pod" Inter9pod"traffic"handled"by"suffix"table" Suffixes"based"off"host"IDs,"ensures"spread"of" traffic"across"core"switches" Prevents"packet"reordering"by"having"staUc"path" Each"host9to9host"communicaUon"has"a"single" stauc"path" Beier"than"having"a"single"path"between"subnets" RouUng"Algorithm"(cont.)" RouUng"Algorithm"Example" Core"switches"contain"(10.pod.0.0/16,"port)" entries" StaUcally"forwards"inter9pod"traffic"on"specified"port" Aggregate"switches"contain"(10.pod.switch.0/24," port)"entries" Switch"value"is"the"edge"switch"number" Assumes"a"central"enUty"with"full"knowledge"of" topology"generates"these"rouung"tables" Also"responsible"for"detecUng"switch"failures"and"re9 rouung"traffic"
Dynamic"RouUng"Techniques" AlternaUves"to"two9level"rouUng"table" Aiempt"to"classify"and"schedule"flows"rather"than"use" stauc"rouung" Flow"ClassificaUon" Periodically"reassigns"flow"output"ports" Prevents"compeUUon"between"flows"for"a"single"port" Flow"Scheduling" IdenUfy"large"flows"and"establish"reserved"paths"for" them" Requires"communicaUon"between"edge"switches"and" a"central"flow"scheduler" Fault"Tolerance" Many"possible"paths"between"hosts"leads"to" easy "fault"tolerance" Each"switch"maintains"BidirecUonal" Forwarding"DetecUon"session"with"neighbors" Allows"switch"to"determine"when"neighbors"fail" Two"primary"types"of"link"failure" Between"lower"and"upper"switches" Between"upper"and"core"switches" Router"Power"and"Heat"DissipaUon" Topology"Power/Heat"DissipaUon"
Cafarella,"2013" Cafarella,"2013" Hamilton,"2008" Emerson"Network"Power," 2007" Cafarella,"2013" EPA,"2007" Sotware"ImplementaUon" Validated"in"sotware"using"Click" Click"is"a"modular"sotware"router"architecture" Implement"routers"on"PCs,"supports"experimental" router"designs" Click"modules"called" elements " Each"element"performs"a"specified"task" RouUng"table"lookup,"decrement"packet"TTL,"etc " Implemented"elements"for"two9level"table," flow"classifier,"and"flow"scheduler"
EvaluaUon"Setup" Uses"a"49port"fat9tree"as"seen"previously" Two9level"table"and"flow9based"schemes"analyzed" Compared"against"hierarchical"tree"with" oversubscripuon"rauo"of"3.6:1" Both"evaluated"using"Click" Emulate"switches"and"hosts"on"PCs" All"hosts"generate"96"Mbit/s"of"outgoing" traffic" This"value"prevents"CPU"from"throiling"test" EvaluaUon"Results" Percentages"indicate"aggregate"network" bandwidth" Measured"as"amount"of"incoming"traffic"received"by" hosts"" Flow"Scheduler"Requirements" Minimal"Ume"and"memory"requirements"for" flow"scheduler" Feasible"to"use"at"least"unUl"k"grows" extremely"large" Packaging"Problem" Fat9tree"has"significant"cabling"overhead" 1"GigE"switches"used"to"reduce"cost" Lack"of"10"GigE"ports"leads"to"more"cabling" Present"a"packaging"soluUon"for"k=48" Generalizes"to"other"values"of"k"
Packaging"SoluUon" Strengths" Fat9tree"architecture"seems"to"outperform" hierarchical"soluuon" Excellent"power"and"heat"reducUons"over" hierarchical"approach" EvaluaUon"methods"were"good"overall"with" tests"performed" Data"centers"can"easily"switch"to"this"new" method" Weaknesses" Language"used"in"paper"was"confusing"at" Umes" Referred"to"pod"switches"as" aggregate"switch," upper9layer"switch,"and" upper"pod"switch "at" various"points" EvaluaUon"performed"with"small"value"of"k=4" Would"have"been"nice"to"see"higher"values"of"k" tested" Academic"project"and"resources"were"obviously"a" factor"for"evaluauon" References" Al9Fares,"Loukissas,"Vahdat,"" A"Scalable,"Commodity"Data"Center"Network"Architecture,""Proc.&of&ACM& SIGCOMM&'08,"38(4):63974,"Oct."2008." Cafarella,"M."(2013,"April"20)."Datacenters."EECS&485."Lecture"conducted"from" University"of"Michigan,"Ann"Arbor." "Energy"Efficient"Cooling"SoluUons"for"Data"Centers.""Emerson&Network&Power." 2007"Web."28"Oct."2013."<hip://www.emersonnetworkpower.com/documents/ en9us/latest9thinking/edc/documents/white%20paper/ energy_efficient_cooling_soluuons_for_data_centers.pdf>." Hamilton,"James.""PerspecUves"9"Cost"of"Power"in"Large9Scale"Data" Centers.""Perspec>ves&@&James&Hamilton's&Blog."N.p.,"28"Nov."2008."Web."28"Oct." 2013."<hip://perspecUves.mvdirona.com/2008/11/28/ CostOfPowerInLargeScaleDataCenters.aspx>." "Report"to"Congress"on"Server"and"Data"Center"Energy"Efficiency.""Energy&Star." U.S."Environmental"ProtecUon"Agency,"2"Aug."2007."Web."28"Oct."2013."<hip:// www.energystar.gov/ia/partners/prod_development/downloads/ EPA_Datacenter_Report_Congress_Final1.pdf?db729bf5a>."