Lecture 1 Prcessrs Applied Cmputer Science Functinal units are the basic building blcks f prcessrs cre units fr a typical prcessrs include: Instructin Unit fetches, decdes and dispatches instructins. May als be respnsible fr scheduling instructins Integer Unit handles integer arithmetic and lgical peratins. AKA Arithmetic Lgic Unit (ALU) Flating Pint Unit handle flating pint multiply, divide and additin Cntrl unit respnsible fr branches and jumps Lad/Stre Unit lads data t/frm memry egister File lcal strage fr CPU. Has separate files fr integer and flating pint Every prcessr executes an instructin set tw main types CISC cmplex instructin. Minimise the number in each prgram. Easier t cmpile fr ISC lw-level, fast instructins. Minimises the cycles per instructin Distinctin between these is blurred by mre cmplex ISC and CISC being implemented in a ISC-like way A typical ISC IS cntains: Cntrl Arithmetic Flating Pint Data Transfer Pipelining is a key way f making fast CPUs Executin is split int steps, each implemented in a clck cycle Steps are pipelined n different data There are three majr prblems with pipelines: Structural hazards tw instructins bth require the same hardware at the same time Data Hazards ne instructin depends n the result further dwn the pipeline Cntrl Hazards result f instructin changes which instructin is next, e.g. branches Data hazards can be minimised by frwarding the result needed is sent directly t the unit needing the result r by cmpiler re-rdering instructins t avid this Out-f-rder executin is a recent technique t mve scheduling f instructins t the hardware requires cmplex bkkeeping f instructins, which is implemented as a screbard. Als may rename registers. Cntrl hazards can be cntrlled by design pipeline such that it can back ut f branches. Can als use branch delay slts execute instructin in the shadw f the branch. Can avid cntrl hazards by branch predictin dynamically in hardware. Use 1 r 2 bit predictin. Pipelining is a frm f Instructin Level Parallelism (ILP). Tw appraches t ILP: Superscalar prcesses parallel identified at run-time, in hardware VLI (Very Lng Instructin rd). Parallel instructin identified by the cmpiler
Superscalar prcessrs divide instructins int classes which use different resurces i.e. add an integer and flating pint simultaneusly This is detected in hardware: Fetch several instructins Decide their class and if they can be parallel Take int accunt structural and data hazards The cmpiler may help in this prcess VLI explicitly issue multiple instructins per clck cycle gruped tgether in ne lng instructin There are restrictins n the cmbinatin f the instructins inside the wrd ILP has limits due t finding enugh independent instructins Lecture 2 Vectr Prcessrs Pipelining has limitatins lnger pipelines induce mre hazards, instructin fetch and decde rate becmes a bttleneck A vectr prcessr limits what can be pipelined t vercme this: Cmputatin f each vectr element is independent One vectr peratin can be encded in a single instructin egular memry access Fewer branches In additin t the nrmal scalar units, vectr prcessrs als cntain: Vectr registers Vectr functinal units Vectr lad/stre units Vectr instructins are vectr versin f scalar instructins, but they have a start up cst. Time t execute a vectr lp: T + n startup T element Time per element: T n + startup T element Lng lps apprach peak perfrmance A useful metric is N ½ the length f lp needed t get t half peak perfrmance Vectrising inner lps requires nn-unit strides lad/stre unit can specify this Chaining allws dependencies t such as results needed fr next calculatin t be handled within the pipeline elies n cmpiler t detect independence f lps Sme peratins can seriusly enhance vectr cmputing t mre generalised cases, nt just lng lps:
Cnditinal executin using a vectr mask still d all sums, but nly utput thse needed Scatter/Gather indirect addressing, lad/stre can wrk with an index vectr Lecture 3 Caches Prcessrs are speeding up much faster than memry Prgrams exhibit tw types f lcality: Tempral lcality recently accessed items Spatial lcality next element in an array Cache can hld cpies f data r instructins frm main memry and is linked clser (and hence faster) t the CPU A cache blck is the minimum unit f data a cache can hld, nrmally 32 t 128 bytes. Smetimes called a cache line Each cache line has an index. Data with certain memry addresses are placed in certain cache lines. Sme data maps nt the same cache line nly ne f these can be cached Design decisins fr cache: hen shuld a cpy be made here t place a blck Hw t find a blck hich blck t replace after a miss hat happens n a memry write Methds fr slving these must be simple i.e. cheap and easy hen t cache? Always cache n reads If a miss ccurs, cache the value here t cache? Cache blcks have a number, use this t determine where tw schemes Direct mapped ignre last n bits f the address, where 2n is the blck size. Blck number (index) is remaining bits MOD n. f blcks in cache Full Address Set Assciativity cache is divided int sets, use number f sets instead f number f blcks t get index. E.g. 32kb cache culd have 4 8kb sets (4- way assciative). If an index is full, there are three mre spaces Hw t find a cache blck hen an address is laded, check the cache Find the crrect index Each blck has an address tag check this Check the addresses valid bit epeat fr all sets in the index hich blck t replace? hat happens n a write? In direct mapped blcks n chice In sets can replace: andm Least recently used (LU) better, but harder Tw basic strategies: rite thrugh write t cache and main memry. Main memry is always up t date, useful fr multiprcessr/io. Uses mre bandwidth rite back write t cache nly. rite t main memry when blck replaced. Need dirty/clean bit. educed bandwidth, harder t implement. Cache perfrmance = hit time + miss rati x miss time Try t minimise all three Cache misses can be divided int three types: Blck Index Cmpulsry r Cld start first ever access Capacity cache t small fr data Blck ffset
Cnflict misses caused by data mapping nt same blck Blck size is a trade ff: Larger blck size results in fewer misses explits spatial lcality Larger blcks have higher miss time and can cause additinal capacity and cnflict misses Having mre sets reduces cnflicts 8 is gd number, but this increases hit time Culd als use a victim cache t reduce cnflicts a small buffer t stre mst recently ejected Prefetching lads the data int cache befre the lad is issued. Hardware prefetching is simple fetch next blck t that accesses spatial lcality Cmpilers can place prefetch instructin befre the actual lad T reduce miss time use multiple levels f cache The L2 cache shuld/will be: Larger than L1 therwise will still get misses Slwer than L2 Larger blck size Cherent with L1 Lecture 4 Alternatives t Caching Prblems with caches: Caches rely n lcality scientific cde has pr tempral lcality Hard t predict cache perfrmance can t let cmplier help Cherency n multiprcessr machines need cmplex hardware Three main alternatives t caches Vectr prcessrs pipeline lads, like prefetching Hide memry latency by using multi-threaded system need high speed switch Cray MTA Decupled prcessrs divide prcessr int: Cntrl Unit cntrl flw & issues instructins. uns ahead f ther units frming a huge pipeline Address Unit address arithmetic, lad/stres. uns ahead f data unit Data unit arithmetic Only example ACI funded by EC never built Lecture 5 Memry Tw technlgies: DAM cheaper, slwer, needs t refresh data AM faster, expensive, n refresh used fr caches Cannt imprve latency, s imprve bandwidth. Allws lading f whle L2 cache lines Memry rganised in banks can read/write cncurrently frm different banks interleaving Prblems with accessing data in same bank prime number f banks, lts f banks reduce this Virtual memry maps memry t disk think f it like a big cache. Blck are pages Mapping frm virtual t physical address is stred in page table in memry Acts like fully assciative cache with LU replacement and write back Lecture 6 Cache Cherency Majr prblem in multiprcessr machines A memry system is cherent if: A prcessr writes a memry lcatin, X and subsequently reads it, it will see the value it wrte, prvided n ther writes have taken place at X Prcessr P1 write a memry lcatin X, and prcessr P2 subsequently reads it, P2 will see the value written by P1, prvided the accesses have ccurred with sufficient time between If prcessr P1, writes value V1 t X, then subsequently writes V2 t X, n prcessr can read X and see V1 if they have seen V2 Cherence and cnsistency are different, but cmplementary Need t share inf n the sharing status f caches is this cache blck stred elsewhere? There are tw main methds: Directry based
Snping r bradcast based example belw Als have tw methds fr cherency: rite invalidate all ther vales are invalidated when a write takes place Prcessr Activity Bus Activity Value in memry P1 cache value P2 cache value 11 - - P1 read Miss 11 11 - P2 read Miss 11 11 11 P1 write Invalidate 11 25 - P2 read Miss 25 25 25 rite update all ther values are updated when a write takes place Prcessr Activity Bus Activity Value in memry P1 cache value P2 cache value 11 - - P1 read Miss 11 11 - P2 read Miss 11 11 11 P1 write Bradcast 25 25 25 P2 read 25 25 25 Invalidate requires less bandwidth, but causes mre cache misses chice fr mst systems Use cache valid tag fr the invalidate Need extra tag t indicate sharing status can use clean/dirty in write back caches All prcessrs mnitr bus traffic Each cache blck exists in ne f three states: Exclusive read/write access Shared read nly Invalid d nt use Fr a directry based system n bus Use a bit vectr fr every blck, ne bit per prcessr which is stred in distributed memry Still have three states as in snp Data has a hme cache directry recrds this and can write back t this Fr CC-NUMA machines migrate data using sftware Tutrial Memry Cnsistency Mdels Memry cnsistency mdels cnsists f rules determining the rder in which writes by ne prcessr must be bserved by ther prcessrs (r ther devices). Mdels have release and acquire t synchrnise memry accesses Simplest mdel is sequential access result is same as thugh prgram ran sequentially implemented by making sure writes can nly ccur when all ther writes have been cmpleted. Very simple, but an verkill First peratin Secnd peratin Prcessr cnsistency r ttal stre rdering allws reads t ccur befre writes have cmpleted First peratin Secnd peratin
Partial stre rdering allws writes t different addresses t ccur ut f rder First peratin Secnd peratin eak rdering allws read and writes t ccur in any rder First peratin Secnd peratin elease cnsistency allws sme read/writes t ccur utside the acquire/release First peratin Secnd peratin Lecture 7 Intercnnects Netwrks span a range f scales: AN LAN N MPP
Serial vs. parallel hw many bits can be sent at nce Synchrnus vs. Asynchrnus Active vs. passive d messages perfrm sme prcessing Circuit switched vs. packet switched send whle message r send bits Uni-r bi-directinal Can access netwrk with special addresses r instructins n the prcessr Netwrks have tw characteristics bandwidth and latency Netwrk tplgies are described with a graph machine are ndes, cnnectins are edges Can be fixed r dynamic Varius metrics t describe a tplgy Diameter maximum number f edged traversed between any tw ndes Degree number f links cnnected t each nde Bisectin width minimum number f links between any bi-partitin f ndes Bisectin bandwidth minimum ttal bandwidth between any bi-partitin f ndes Pint-t-pint bandwidth minimum bandwidth between any tw ndes Latency maximum transfer time between any tw ndes Sme static tplgies Message can specify rute r just the destinatin Can be deterministic r adaptive same every time changes Can be minimal r nn-minimal takes the shrtest path r nt XY ruting is a simple algrithm g in the X directin until yu get t the x c-rdinate, then g in Y directin this is deterministic an minimal If message is packets: Stre and frward packets traverse ne link at a time. 1 st packet waits fr last packet Cut-thrugh different packets can take different paths. Packets catch up, but nt vertake rmhle as abve, but packets d nt catch up Buses are a simple dynamic tplgy cheap bradcasts, but pr scaling. Can have hierarchical buses r multiple buses A crssbar is a general switch any input can be cnnected t any utput
Other netwrks are: Omega each switch is a 2x2 crssbar nt all paths can be used at nce Baseline as Omega, but can be expanded recursively Cls switches are mxm and kxk crssbars. Can be recursive Lecture 9 Operating Systems Any OS has three main functins: Hardware abstractin esurce management User interface esurce management is the main influence n perfrmance. Carries ut three basic functins: Multiplexing Lcking Access cntrl A kernel is the part f the OS respnsible fr hardware abstractin. Applicatin access the system via the kernel Only ne prgram can access the CPU at any ne time OS must schedule the prgrams using: und bin when a prcess pauses, ges t back f queue Priritised und bin each prgram has a pririty assciated with it higher pririty means it will have first chice The pririties can be given dynamically prevents starvatin f lw pririty tasks If there is mre than ne prcessr run several prgrams simultaneusly try t maintain the prgrams affinity with a prcessr Each prgram may have multiple threads shuld the OS be fair t prgrams r threads? Mst OS s treat threads as lightweight prgrams can run n a different CPU, unlike threads Lecture 10 Cmpiler Architecture A cmpiler turns surce cde int machine cde A gd cmpiler shuld: Be bug free Generate crrect machine cde Generate fast machine cde Prduce cnsistent and predictable ptimisatins Generate debuggable machine cde Be quick t cmpile Give gd errr messages Be mdular, t cpe with multiple I Frnt end f cmpiler prduces intermediate representatin: Lexical analyser turns ASCII cde t a stream f tkens. Strips ut whitespace and cmments: x=x*(b+1) id(x) = id(x) * ( id(b) + num(1) )
Parser cnverts tken stream int a parse tree and decides n any grammatical/syntax errrs: Semantic analyser uses the parse tree t check fr semantic errrs i.e. type checking I generatr language which can represent many high-level languages. I is specific t cmpilers, can include lps/arrays, r be mre lw-level t1 := b+1 x := x*t1 The back end f a cmpiler cntains: I ptimiser perfrms language independent ptimisatins The cde generatr turn I int assembly cde nt a trivial prcess Lw-level ptimiser perfrms machine specific ptimisatins Fr Java, the cmpiling prcess prduces a byte cde. The cde is then run as interpreted cde r cmpiled whilst running (JIT). JIT has the functinality f a back-end cmpiler. Lecture 11 Optimisatin I Basic ptimisatins: Cnstant flding replace any cnstant with the value Algebraic simplificatins make use f assciativity and cmmutativity Cpy prpagatin use the mst recent cpy f a value, reduces number f registers used x=y c=x+3 d=x+y x=y c=y+3 d=y+y Cnstant prpagatin if a variable is assigned t a cnstant, replace with cnstant until it changes edundancy eliminatin Assign expressin t temprary values if they are repeated elsewhere Mve expressins ut f lps if they can emve dead cde Simple lp ptimisatins: Strength reductin remve cmputatin n an inductin variable and replace with a lp index dependant variable emve any inductin variables emve any array bunds checking t utside lp in Java and perl Prcedure call ptimisatins Inline prcedures Expansin dne at assembly cde level Lecture 12 Optimisatin II Cmpiler tries t minimise the number f instructins executed, minimise the number f pipeline stalls and enable multiple issues (VLI r superscalar) Cmpiler als tries t find instructins t put in branch delay slt Lp unrlling increase size f lp bdy t increase changes f instructin rescheduling Need t clean up after lp unrlling
Lps cannt have any cmplex branch cntrls Variable expansin can help break dependencies egister names an be reused t increase amunt f pssible rescheduling Sftware pipelining needs prlgue and epilgue fr (i=0;i<n;i++) { a(i) += b(i); fr (i=0;i<n;i++) { t1 = a(i); t2 = b + t1; a(i) = t2; This is dne at the instructins level egister allcatin is dne using graph cluring try and avid spilling A nrmal strategy is: Sftware pipelining with unrlling Basic blck and branch scheduling egister allcatin Basic blck and branch scheduling again Lecture 13 Optimisatin III Sme ptimisatin are nt always dne by a cmpiler Array rdering: //prlgue t1 = a(0); t2 = b + t1; t1 = a(1); fr (i=0;i<n;i++) { a(i) = t2; t2 = b + t1; t1 = a(i+2); //epilgue a(n-2) = t2; t2 = b+ t1; a(n-1) = t2; unning a lp backwards may imprve tempral lcality if yu ve just run it frwards Lp skewing may enable mre transfrms i.e. use (i=j;i<m+j;i++) instead f referring t lcatin (j+i). Fuse lps tgether that have same iteratin space Distribute lps (ppsite f fusin) may reduce register cnflicts Lp tiling can make sure lp segments fit int cache fr (i=0;i<n;i++){ fr (j=0;j<n;j++) { a[i][j] += b[i][j]; fr (ii=;ii<n;ii+=b){ fr (jj=0;j<n;jj+=b){ fr(i=ii;i<ii+b;i++){ fr(j=jj;j<jj+b;j++){ a[i][j] += b[i][j]; Tiling is mst effective when the data is reused within a lp Padding arrays with extra elements (<cache line size) will reduce cnflict misses