A Generic and Compositional Framework for Multicore Response Time Analysis

Transcription

1 A Generc and Compostonal Framework for Multcore Response Tme Analyss Sebastan Altmeyer Unversty of Luxembourg Unversty of Amsterdam Clare Maza Grenoble INP Vermag Robert I. Davs Unversty of York INRIA, Pars-Rocquencourt Vncent Nels CISTER, ISEP, Porto Leandro Indrusak Unversty of York Jan Reneke Saarland Unversty ABSTRACT In ths paper, we ntroduce a Multcore Response Tme Analyss (MRTA) framework. Ths framework s extensble to dfferent multcore archtectures, wth varous types and arrangements of local memory, and dfferent arbtraton polces for the common nterconnects. We nstantate the framework for sngle level local data and nstructon memores (cache or scratchpads), for a varety of memory bus arbtraton polces, ncludng: Round-Robn, FIFO, Fxed-Prorty, Processor-Prorty, and TDMA, and account for DRAM refreshes. The MRTA framework provdes a general approach to tmng verfcaton for multcore systems that s parametrc n the hardware confguraton and so can be used at the archtectural desgn stage to compare the guaranteed levels of performance that can be obtaned wth dfferent hardware confguratons. The MRTA framework decouples response tme analyss from a relance on context ndependent WCET values. Instead, the analyss formulates response tmes drectly from the demands on dfferent hardware resources. 1. INTRODUCTION Effectve analyss of the worst-case tmng behavour of systems bult on multcore archtectures s essental f these hgh performance platforms are to be deployed n crtcal real-tme embedded systems used n the automotve and aerospace ndustres. We dentfy four dfferent approaches to solvng the problem of determnng tmng correctness. Wth sngle core systems, a tradtonal two-step approach s typcally used. Ths conssts of tmng analyss whch determnes the context-ndependent worst-case executon tme (WCET) of each task, followed by schedulablty analyss, whch uses task WCETs and nformaton about the processor schedulng polcy to determne f each task can be guaranteed to meet ts deadlne. When local memory (e.g. cache) s present, then ths approach can be augmented by analyss of Cache Related Pre-empton Delays (CRPD) [4], or by parttonng the cache to avod CRPD altogether. Both approaches are effectve and result n tght upper bounds on task response tmes [5]. Wth a multcore system, the stuaton s more complex snce Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. Copyrghts for components of ths work owned by others than the author(s) must be honored. Abstractng wth credt s permtted. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. Request permssons from permssons@acm.org. RTNS 2015, November 04-06, 2015, Llle, France c 2015 Copyrght held by the owner/author(s). Publcaton rghts lcensed to ACM. ISBN /15/11... $15.00 DOI: WCETs are strongly dependent on the amount of cross-core nterference on shared hardware resources such as man memory, L2-caches, and common nterconnects, due to tasks runnng on other cores. The uncertanty and varablty n ths cross-core nterference renders the tradtonal two-step process neffectve for many multcore processors. For example, on the Freescale P4080, the latency of a read operaton vares from 40 to 600 cycles dependng on the total number of cores runnng and the number of competng tasks [32]. Smlarly, a 14 tmes slowdown has been reported [35] due to nterference on the L2-cache for tasks runnng on Intel 2 Quad processors. At the other extreme s a fully ntegrated approach. Ths nvolves consderng the precse nterleavng of nstructons orgnatng from dfferent cores [19]; however, such an approach suffers from potentally nsurmountable problems of combnatoral complexty, due to the prolferaton of dfferent path combnatons, as well as dfferent release tmes and schedules. An alternatve approach s based on temporal solaton [14]. The dea here s to statcally partton the use of shared resources, e.g. space parttonng of cache and DRAM banks, tme parttonng of bus access, so that context-ndependent WCET values can be used and the tradtonal two-step process appled. Ths approach rases a further challenge, how to partton the resources to obtan schedulablty [36]. Technques whch seek to lmt the worst-case cross-core nterference, for example by usng TDMA arbtraton on the memory bus or by lmtng the amount of contenton by suspendng executon on certan cores [32], can have a sgnfcant detrmental effect on performance, effectvely negatng the performance benefts of usng a multcore system altogether. We note that TDMA s rarely f ever used as a bus-arbtraton polcy n real multcore processors, snce t s not work-conservng and so wastes sgnfcant bandwdth. Ths mpacts both worst-case and average-case performance; essental for applcaton areas such as telecommuncatons, whch have a major nfluence on processor desgn. The fnal approach s the one presented n ths paper, based on explct nterference modellng. We explore the premse that due to the strong nterdependences between tmng analyss and schedulablty analyss on multcore systems, they need to be consdered together. In our approach, we omt the noton of WCET per se and nstead drectly target the calculaton of task response tmes. In ths work, we use executon traces to model the behavour of tasks. Traces provde a smple yet expressve way to model task behavour. Note that relyng on executon traces does not pose a fundamental lmtaton to our approach as all requred performance quanttes can also be derved usng statc analyss [28, 17, 1] as wthn the tradtonal context-ndependent tmng analyss; however, traces enable a near-trval statc cache

2 analyss and so allow us to focus on response tme analyss. The man performance metrcs are the processor demand and the memory demand of each task. The latter quantty feeds nto analyss of the arbtraton polcy used by the common nterconnect, enablng us to upper bound the total memory access delays whch may occur durng the response tme of the task. By computng the overall processor demand and memory demand over a relatvely long nterval of tme (.e. the task response tme), as opposed to summng the worst case over many short ntervals (e.g. ndvdual memory accesses), we are able to obtan much tghter response tme bounds. The Multcore Response Tme Analyss framework (MRTA) that we present s extensble to dfferent types and arrangements of local memory, and dfferent arbtraton polces for the common nterconnect. In ths paper, we nstantate the MRTA framework assumng the local memores used for nstructons and data are sngle-level and ether cache, scratchpad, or not present. Further, we assume that the memory bus arbtraton polcy may be TDMA, FIFO, Round-Robn, or Fxed-Prorty (based on task prortes), or Processor-Prorty. We also account for the effects of DRAM refresh [6, 11]. The general approach emboded n the MRTA framework s extensble to more complex, mult-level memory herarches, and other sources of nterference. It provdes a general tmng verfcaton framework that s parametrc n the hardware confguraton (common nterconnect, local memores, number of cores etc.) and so can be used at the archtectural desgn stage to compare the guaranteed levels of performance that can be obtaned wth dfferent hardware confguratons, and also durng the development and ntegraton stages to verfy the tmng behavour of a specfc system. Whle the specfc hardware models and ther mathematcal representatons used n ths paper cannot capture all of the nterference and complexty of actual hardware, they serve as a vald startng pont. They nclude the domnant sources of nterference and represent current archtectures reasonably well. 2. RELATED WORK In 2007, Rosen et al. [37] proposed an mplementaton n whch TDMA slots on the bus are statcally allocated to cores. Ths technque reles on the avalablty of a user-programmable table-drven bus arbter, whch s typcally not avalable n real hardware, and on knowledge at desgn tme, of the characterstcs of the entre workload that executes on each core. Chattopadhyay et al. [15] and Kelter et al. [22] proposed an analyss whch takes nto account a shared bus and nstructon cache, assumng separate buses and memores for both code and data (uncommon n real hardware) and TDMA bus arbtraton. The method has a lmted applcablty as t does not address data accesses to memory. In 2010, Schranzhofer et al. [39] developed a framework for analysng the worst-case response tme of real-tme tasks on a mult core wth TDMA arbtraton. Ths was followed by work on resource adaptve arbters [40]. They proposed a task model n whch tasks consst of sequences of super-blocks, themselves dvded nto phases that represent mplct communcaton (fetchng or wrtng of data from/to memory), computaton (processng the data), or both. Contrary to the technque presented here, ther approach requres major program nterventon and compler assstance to prefetch data. Also n 2010, Lv et al. [30] proposed a method to model request patterns and the memory bus usng tmed automata. Ther method handles nstructon accesses only and may suffer from state-space exploson when appled to data accesses. A method employng tmed automata was proposed by Gustavsson et al. [19] n whch the WCET s obtaned by provng specal predcates through model checkng. Ths approach allows for a detaled system modellng but s also prone to the state-space exploson problem. In 2014, Kelter et al. [23] analysed the maxmum bus arbtraton delays for multprocessor systems sharng a TDMA bus and usng both (prvate) L1 and (shared) L2 nstructon and data caches. Pellzzon et al. [34] compute an upper bound on the contenton delay ncurred by perodc tasks, for systems comprsng any number of cores and perpheral buses sharng a sngle man memory. Ther method does not cater for non-perodc tasks and does not apply to systems wth shared caches. In addton t reles on accurate proflng of cache utlzaton, sutable assgnment of the TDMA tme-slots to the tasks super-blocks, and mposes a restrcton on where the tasks can be pre-empted. Schlecker et al. [38] proposed a method that employs a general event-based model to estmate the maxmum load on a shared resource. Ths approach makes very few assumptons about the task model and s thus qute generally applcable. However, t only supports a sngle unspecfed work-conservng bus arbter. Paoler et al. [33] proposed a hardware platform that enforces a constant upper bound on the latency of each access to a shared resource. Ths approach enables the analyss of tasks n solaton snce the nterference on other tasks can be conservatvely accounted for usng ths bound. Smlarly, the PTARM [29] enforces constant latences for all nstructons, ncludng loads and stores. However, both cases represent customzed hardware. Km et al. [24] presented a model to upper bound the memory nterference delay caused by concurrent accesses to a shared DRAM man memory. Ther work dffers from ths paper n that they do not assume a unque shared bus to access the man memory and they prmarly focus on the contenton at the DRAM controller by assumng a fully parttoned prvate and shared cache model. (For shared caches they assume that the extra number of requests generated due to cache lne evctons at runtme s gven). Yun et al. [42] proposed a software-based memory throttlng mechansm to explctly lmt the memory request rate of each core and thereby control the memory nterference. They also developed analytcal solutons to compute proper throttlng parameters that satsfy schedulablty of crtcal tasks whle mnmsng the performance mpact of throttlng. In 2015, Dasar et al. [16] proposed a general framework to compute the maxmum nterference caused by the shared memory bus and ts mpact on the executon tme of the tasks runnng on the cores. The method n [16] s more complex than that proposed n ths paper, and may be more accurate when t estmates the delay due to the shared bus, but t does not take cache-related effects nto account (by assumng parttoned caches), whch makes t less generc than the framework proposed here. Regardng shared caches, Yan and Zhang [41] addressed the problem of computng the WCET of tasks assumng drect-mapped, shared L2 nstructon caches on multcores. The applcablty of the approach s unfortunately lmted as t makes very restrctve assumptons such as (1) data caches are perfect,.e. all accesses are hts, and (2) data references from dfferent threads wll not nterfere wth each other n the shared L2 cache. L et al. [27] proposed a method to estmate the worst-case response tme of concurrent programs runnng on multcores wth shared L2 caches, assumng set-assocatve nstructon caches usng the LRU replacement polcy. Ther work was later extended [15] by addng a TDMA bus analyss technque to bound the memory access delay. 3. SYSTEM MODEL In ths paper, we provde a theoretcal framework that can be nstantated for a range of dfferent multcore archtectures wth dfferent types of memory herarchy and dfferent arbtraton polces for the common nterconnect. Our am s to create a flexble, adaptable, and generc analyss framework wheren a large number of common multcore archtecture desgns can be modeled and analysed. In ths paper nevtably we can only cover

3 a lmted number of types of local memory, bus, and global memory behavour. We select common approaches to model the dfferent hardware components and ntegrate them nto an extensble framework. 3.1 Multcore Archtectural Model We model a generc multcore platform wth l tmng-compostonal cores P 1,... P l as depcted n Fgure 1. By tmng-compostonal cores we mean cores where t s safe to separately account for delays from dfferent sources, such as computaton on a gven core and nterference on a shared bus [20]. The set of cores s defned as P. Each core has a local memory whch s connected va a shared bus to a global memory and IO nterface. We assume constant delays d man to retreve data from global memory under the assumpton of an mmedate bus access,.e., no wat-cycles or contenton on the bus. We assume atomc bus transactons,.e., no splt transactons, whch furthermore are not re-ordered, and non-preemptable busy watng on the processor for requests to be servced. Further, we assume that bus access may be gven to cores for one access at a tme. The types of the memores and the bus polcy are parameters that can be nstantated to model dfferent multcore systems. In ths paper, we omt a consderaton of delays due to cache coherence and synchronzaton, and we assume wrte-through caches only. Wrte-back caches are dscussed n techncal report [2]. IO/ global memory Fgure 1: Multcore Platform. A set of l processors wth local memores connected va a common bus to a global memory. 3.2 Task Model We assume a set of n sporadc tasks {τ 1,..., τ n }, each task τ has a mnmum perod or nter-arrval tme T and a deadlne D. Deadlnes are assumed to be constraned, hence D T. We assume that the tasks are statcally parttoned to the set of l dentcal cores {P 1,..., P l }, and scheduled on each core usng fxed-prorty pre-emptve schedulng. The set of tasks assgned to core P x s denoted by Γ x. The ndex of each task s unque and thus provdes a global prorty order, wth τ 1 havng the hghest prorty and τ n the lowest. The global prorty of each task translates to a local prorty order on each core whch s used for schedulng purposes. We use hp() (lp()) to denote the set of tasks wth hgher (lower) prorty than that of task τ, and we use hep() (lep()) to denote the set of tasks wth hgher or equal (lower or equal) prorty to task τ. We ntally assume that the tasks are ndependent, n so far as they do not share mutually exclusve software resources (dscussed n the techncal report [2]); nevertheless, the tasks compete for hardware resources such as the processor, memory, and the bus. The executon of task τ s modelled usng a set of traces O, where each trace o = [ι 1,... ι k ] s an ordered lst of nstructons. For ease of notaton, we treat the ordered lst of nstructons as a mult-set, whenever we can abstract away from the specfc order. We dstngush three types of nstructons t: r[m da ] t = w[m da ] e read data from memory block m da wrte data to memory block m da execute (1) An nstructon ι s a trple consstng of the nstructon s memory address m n, ts executon tme wthout memory delays,.e., assumng a perfect local memory, and the nstructon type t: ι = (m n,, t) (2) The set of memory blocks s defned as M. M da denotes the data memory blocks, M n the nstructon memory blocks. We assume that data and nstructon memory are dsjont,.e, M n M da =. The use of traces to model a task s behavour s unusual as the number of traces s exponental n the number of control-flow branches. Despte ths obvous drawback, traces provde a smple yet expressve way to model task behavour. They enable a near-trval statc cache analyss and a smple multcore smulaton to evaluate the accuracy of the tmng verfcaton framework. However, most mportantly, traces show that the worst-case executon behavour of a task τ on a multcore system s not unquely defned. From the vewpont of a task scheduled on the same core, τ may have the hghest mpact when t uses the core for the longest possble tme nterval, whereas the mpact on tasks scheduled on any other core may be maxmzed when τ produces the largest number of bus accesses. These two cases may well correspond to dfferent executon traces. As a remedy for the exponental number of traces, the complexty can be reduced by () computng a synthetc worst-case trace or () by dervng the set of Pareto optmal traces that maxmze the task s mpact accordng to a pre-defned cost functon (see [28]). We can also completely resort to statc analyss to derve upper bounds on the performance metrcs. Statc analyses provde ndependent upper bounds on the dfferent performance quanttes. Ths strongly reduces the computatonal complexty, but may lead to pessmsm. An evaluaton of ths trade-off s future work. 4. MEMORY MODELLING In ths secton we show how the effects of a local memory can be modelled va a MEM functon whch descrbes the number of accesses due to a task whch are passed to the next level of the memory herarchy, n ths case man memory. The MEM functon s nstantated for both cache and scratchpads. We model the effect of a (local) memory usng a functon of the form: MEM: O N 2 2N 2 N (3) where MEM(o) = (MD o, UCB o, ECB o ) computes, for a trace o, the number of bus accesses.e., the number of memory accesses whch cannot be served by the local memory alone (denoted as memory demand MD), UCB o whch denotes a multset contanng, for each program pont n trace o, the set of Useful Cache Blocks (UCBs) [25], whch may need to be reloaded when trace o s pre-empted at that program pont, and the set of Evctng Cache Blocks (ECBs) whch s the set of all cache blocks accessed by trace o whch may evct memory blocks of other tasks. The value MD does not just cover cache msses, but also has to account for wrte accesses. In the case of wrte-through caches, each wrte access wll cause a bus access, rrespectve of whether or not the memory block s present n cache. The number of bus accesses MD assumes non-preemptve executon. Wth pre-emptve executon and caches, more than MD memory accesses can contrbute to the bus contenton due to cache evcton. In ths paper, we make use of the CRPD analyss for fxed-prorty pre-emptve schedulng ntroduced n [4]. We now derve nstantatons of the functon MEM(o) for a trace o = [ι 1,..., ι k ] for nstructon memores and data memores for systems () wthout cache, () wth scratchpads, and () wth drect-mapped or LRU caches. In the followng, the superscrpts ndcate data (da) or nstructon memory (n), the subscrpts the type of memory,.e., uncached (nc), scratchpad (sp), or caches (ca).

4 4.1 Uncached Consderng nstructon memory, the number of bus accesses for a system wth no cache s gven by the number of nstructons k n the trace. The set of UCBs and ECBs are empty. Pre-empton has no effect on the local memory, snce none exsts. MEM n nc(o) = (k,, ) (4) Consderng data memory, we have to account for the number of data accesses, rrespectve of read or wrte access. The number of accesses s thus equal to the number of data access nstructons. ( MEM da { nc(o) = ι ι o ι = (_, _, r/w[m da ]) },, ) (5) 4.2 Scratchpads A scratchpad memory s defned usng a functon SPM: M {true, f alse}, whch returns true for memory blocks that are stored n the scratchpad. For ease of presentaton, we assume a statc a wrte-through scratchpad confguraton, whch does not change at runtme. An extenson to dynamc scratchpads and the wrte-back polcy s straght-forward, but beyond the scope of ths paper. Each memory access to a memory block whch s not stored n the scratchpad causes an addtonal bus access. ( MEM n { sp(o) = m n (m n, _, _) o SPM(m n ) },, ) Further, n the case of wrte accesses, even f a memory block s stored n the scratchpad, that access also contrbutes to the bus contenton as we assume a wrte-through polcy. MEM da sp(o) = ( { m da ( (_, _, r(m da )) o SPM(m da ) ) (_, _, w(m da )) o } ),, The sets of UCBs and ECBs are empty as no pre-empton overhead s assumed wth statc scratchpad memory. Dynamc scratchpad management s dscussed n the techncal report [2]. 4.3 Caches We assume a functon Ht: I M {true, f alse}, whch classfes each memory access at each nstructon as a cache ht or a cache mss. Ths functon can be derved usng cache smulaton of the trace startng wth an empty cache or by usng tradtonal cache analyss [17], where each unclassfed memory access s consdered a cache mss. Ths means that we upper bound the number of cache msses. For each possble pre-empton pont ι on trace o, the set of UCBs s derved usng the correspondng analyss descrbed n Altmeyer s thess [1], Chapter 5, Secton 4. It s suffcent to only store the cache sets a useful memory blocks maps to, nstead of the useful memory blocks. The multset UCB o then contans, for each program pont ι n trace o, the set of UCBs at that program pont,.e, UCB o = ι o UCB ι. The set of ECBs s the set of all cache sets of memory blocks on trace o. MEM n ca(o) = ( { m n ι = (m n, _, _) o Ht(m n, ι ) } ), UCB n o, ECB n o Snce we assume a wrte-through polcy, wrte accesses contrbute to the cache contenton and have to be treated accordngly. ( MEM da { ca(o) = m da ( ι = (_, _, r(m da ) ) o Ht(m da, ι ) ) (_, _, w(m da )) o } ), UCB da o, ECB da o (9) (6) (7) (8) 4.4 Memory Combnatons To allow dfferent combnatons of local memores, for example scratchpad memory for nstructons and an LRU cache for data, we defne the combnaton of nstructon memory MEM n and data memory MEM da as follows MEM(o) = ( MD n o + MD da o, UCB n o UCB da ) o, ECB n o ECB da o (10) ( wth MEM n (o) = MD n o, UCB n ) o, ECB n o beng the result for the ( nstructon memory and MEM da (o) = MD da o, UCB da ) o, ECB da o for the data memory. 5. BUS MODELLING In ths secton we show how the memory bus delays experenced by a task can be modelled va a BUS functon of the form: BUS: N P N N (11) where BUS(, x, t) determnes an upper bound on the number of bus accesses that can delay task τ on processor P x durng a tme nterval of length t. Ths abstracton covers a varety of bus arbtraton polces, ncludng Round-Robn, FIFO, Fxed-Prorty, and Processor-Prorty, all of whch are work-conservng, and also TDMA whch s not work-conservng. We now ntroduce the mathematcal representatons of the delays ncurred under these arbtraton polces. We note that the framework s extensble to a wde varety of dfferent polces. The only constrants we place on nstantatons of the BUS(, x, t) functon s that they are monotoncally non-decreasng n t. Let τ be the task of nterest, and x the ndex of the processor P x on whch t executes. Other task ndces are represented by j, k etc., whle y, z are used for processor ndces. Let S x (t) denote an upper bound on the total number of bus accesses due to τ and all hgher prorty tasks that run on processor P x durng an nterval of length t. Let A y j (t) be an upper bound on the total number of bus accesses due to all tasks of prorty j or hgher executng on some processor P y P x durng an nterval of length t. (Note, j may not necessarly be the prorty of a task allocated to processor P y ). As memory bus requests are typcally non-preemptve, one lower prorty 1 memory request may block a hgher prorty one, snce the global, shared memory may have just receved a lower prorty request before the hgher prorty one arrves. To account for these blockng accesses, we use L y j (t) whch denotes an upper bound on the total number of bus accesses due to all tasks of prorty lower than j executng on some other processor P y P x durng an nterval of length t. In Secton 6 we show how the values of S x(t), Ay j (t) and Ly j (t) are computed and explan why S x (t) and A y j (t) are subtly dfferent and hence requre dstnct notaton. In the followng equatons for the BUS(, x, t) functon, we account for blockng due to one non-preemptve access from lower prorty tasks runnng on the same core P x as task τ (.e. +1 n the equatons). Ths holds because such blockng can only occur at the start of the the prorty level- (processor) busy perod. For a Fxed-Prorty bus wth memory accesses nhertng the prorty of the task that generates them, we have: BUS(, x, t) = S x (t) + A y (t) + mn S x (t), L y (t) + 1 (12) y x 1 Here we mean prortes on the bus, whch are not necessarly the same as task prortes. y x

5 The term mn ( S x(t), y x L y (t)) upper bounds the blockng due to tasks of lower prorty than τ runnng on other cores. For a Processor-Prorty bus wth memory accesses nhertng the prorty of the core rather than the task, we have: BUS(, x, t) = S x (t)+ A y n(t)+mn S x (t), A y n(t) +1 (13) y HP(x) y LP(x) where HP(x) (LP(x)) s the set of processors wth hgher (lower) prorty than that of P x, and n s the ndex of the task wth the lowest prorty. The term A y n(t) thus captures the nterference of all tasks runnng on processor y, ndependent of ther prorty, and the term mn ( S x(t), y x A y n(t) ) upper bounds the blockng due to tasks runnng on processors wth prorty lower than that of P x. For a FIFO bus, we assume that all accesses generated on the other processors may be servced ahead of the last access of τ, hence we have: BUS(, x, t) = S x (t) + A y n(t) + 1 (14) y x Note accesses from other cores do not contrbute blockng snce we already pessmstcally account for all these accesses n the summaton term. For a Round-Robn bus wth a cycle consstng of an equal number of slots v per processor, we have: BUS(, x, t) = S x (t) + mn(a y n(t), v S x (t)) + 1 (15) y x The worst-case stuaton occurs when each access n S x (t) s delayed by each core P y P x for v slots. Interference by core P y s lmted to the number of accesses from core P y. Agan, as we already account for all accesses from all other cores, there s no separate contrbuton to blockng. Note unlke TDMA, Round-Robn moves to the next slot mmedately f a processor has no access pendng. For a TDMA bus wth v adjacent slots per core n a cycle of length l v, we have: BUS(, x, t) = S x (t) + ((l 1) v) S x (t) + 1 (16) Snce TDMA s not work-conservng, the worst case corresponds to each access n S x(t) just mssng a slot for processor P x and hence havng to wat at most ((l 1) v+1) slots to be servced. Effectvely, there s addtonal nterference from the (l 1) v slots reserved for other processors on each access, rrespectve of whether these slots are used or not. As all accesses due to hgher prorty tasks on P x may be servced pror to the last access of task τ we requre S x(t) accesses n total to be servced for P x. Note that when v = 1, Equaton (16) smplfes to BUS(, x, t) = l S x (t) + 1. It s nterestng to note that whle TDMA provdes more predctable behavour, ths s at a cost of sgnfcantly worse guaranteed performance over long tme ntervals (e.g. the response tme of a task) due to the fact that t s not work-conservng. Effectvely, ths means that the memory accesses of a task may suffer addtonal nterference due to empty slots on the bus. Nevertheless, Round-Robn behaves lke TDMA when all other cores create a large number of competng memory accesses. We note that the equal number of slots per core for Round-Robn and TDMA, and the groupng of slots per core are smplfyng assumptons to exemplfy how TDMA and Round-Robn buses can be analysed. An analyss for more complex confguratons s reserved for future work. 6. RESPONSE TIME ANALYSIS In ths secton, we present the centre pont of our tmng verfcaton framework: nterference-aware Multcore Response Tme Analyss (MRTA). Ths analyss ntegrates the processor and memory demands of the task of nterest and hgher prorty tasks runnng on the same processor, ncludng CRPD. It also accounts for the cross-core nterference on the memory bus due to tasks runnng on the other processors. A task set s deemed schedulable, f for each task τ, the response tme R s less than or equal to ts deadlne D : : R D schedulable The tradtonal response tme calculaton [7] [21] for fxed-prorty pre-emptve schedulng on a unprocessor s based on an upper bound on the WCET of each task τ, denoted by C. By contrast, our MRTA framework dssects the ndvdual components (processor and memory demands) that contrbute to the WCET bound and re-assembles them at the level of the worst-case response tme. It thus avods the over-approxmaton nherent n usng context-ndependent WCET bounds. In the followng, we assume that τ s the task of nterest whose schedulablty we are checkng, and P x s the processor on whch t runs. Recall that there s a unque global orderng of task prortes even though the schedulng s parttoned wth a fxed-prorty preemptve scheduler on each processor. 6.1 Interference on the We compute the maxmal processor demand PD for each task τ as follows: PD = max o O (_,,_) o (17) where s the executon tme of an nstructon wthout memory delays. Task τ suffers nterference I PROC (, x, t) on ts core P x due to tasks of hgher prorty runnng on the same core wthn a tme nterval of length t startng from the crtcal nstant: t I PROC (, x, t) = PD j (18) j Γ x j hp() 6.2 Interference on the local memory Local memory mproves a task s executon tme by reducng the number of accesses to man memory The memory demand of a trace gves the number of accesses that go to man memory and hence the bus, despte the presence of the local memory. The maxmal memory demand MD of a task τ s defned by the maxmum number of bus accesses of any of ts traces: T j { MD = max MD } MEM(o) = (MD, _, _) o O (19) Note that the maxmal memory demand refers to the demand of the combned nstructon and data memory as defned n Equaton (10). The memory demand MD s derved assumng non-preemptve executon,.e. that the task runs to completon wthout nterference on the local memory. The sets of UCBs and ECBs are used to compute the addtonal overhead due to pre-empton. In the computaton of ths overhead, we use the sets of UCBs per trace o to preserve precson, UCB o = UCB wth MEM(o) = (_, UCB, _) (20) and derve the maxmal set of ECBs per task τ as the unon of the ECBs on all traces. ECB = {ECB } MEM(o) = (_, _, ECB) (21) o O We use γ, j,x (wth j hp()) to denote the overhead (addtonal accesses) due to a pre-empton of task τ by task τ j on core P x. We use the ECB-Unon [3] approach as an exemplar of CRPD

6 analyss, as t provdes a reasonably precse bound on the pre-empton overhead wth low complexty. Other technques [4] [26] could also be ntegrated nto ths framework, but we omt the explanaton due to space constrants. The ECB-Unon approach consders the UCBs of the pre-empted task per pre-empton pont and assumes that the pre-emptng task τ j has tself already been pre-empted by all tasks wth hgher prorty on the same processor P x. Ths nested pre-empton of the pre-emptng task s represented by the unon of the ECBs of all tasks wth hgher or equal prorty than task τ j (see [4] for a detaled descrpton). γ, j,x = max k hep() lp( j) k Γ x max o O k max UCB ι UCB o UCB ι ECB h h hep( j) h Γ x (22) 6.3 Interference on the Bus We now compute the number of accesses that compete for the bus durng a tme nterval of length t, equatng to the worst-case response tme of the task of nterest τ. We use S x (t) to denote an upper bound on the total number of bus accesses that can occur due to tasks runnng on processor P x durng that tme. Snce lower prorty tasks cannot execute on P x durng the response tme of task τ (a prorty level- processor busy perod), the only contrbuton from those tasks s a sngle blockng access as dscussed n Secton 5. The maxmum delay s computed assumng task τ s released smultaneously wth all hgher prorty tasks that run on P x, and subsequent releases of those tasks occur as soon as possble, whle also assumng that the maxmum possble number of preemptons occur. S x (t) = k Γ x k hep() t T k (MDk ) + γ,k,x (23) MD k denotes the memory demand of task τ k and γ,k,x accounts for the pre-empton costs on core P x due to jobs of task τ k. We use A y j (t) to denote an upper bound on the total number of bus accesses due to all tasks of prorty j or hgher executng on processor P y P x durng an nterval of length t. A specal case s A y n(t): snce τ n s the lowest prorty task, ths term ncludes accesses due to all tasks runnng on processor P y. In contrast to the dervaton of S x(t), for Ay n(t) we can make no assumptons about the synchronsaton or otherwse of tasks on processor P y wth respect to the release of task τ on processor P x. The value of A y j (t) s therefore obtaned by assumng for each task, that the frst job executes as late as possble,.e. just pror to ts worst-case response tme, whle the next and subsequent jobs execute as early as possble. We assume that the frst nterferng job of a task τ k has all of ts memory accesses as late as possble durng ts executon, whle for subsequent jobs the opposte s true, wth executon and memory accesses occurrng as early as possble after release of the job. Ths treatment s smlar to the concept of carry-n nterference used n the analyss of global multprocessor fxed-prorty schedulng [10], and s llustrated n Fgure 2. R k Memory accesses Executon t Fgure 2: Illustraton of the carry-n nterference analyss. The number of complete jobs of task τ k contrbutng accesses n an nterval of length t on processor y s gven by: t + N y j,k (t) = Rk (MD k + γ j,k,y ) d man (24) T k T k Note the term (MD k + γ j,k,y ) d man represents the tme for the memory accesses. Hence the total number of accesses possble n an nterval of length t due to task τ k and ts cache related preempton effects s gven by: W y j,k (t) = Ny j,k (t) (MD k + γ j,k,y )+ mn ( t + R k (MD k + γ j,k,y ) d man N y j,k MD k + γ j,k,y, (t) T k ) d man (25) Hence we have: A y j (t) = k Γ y k hep( j) W y j,k (t) (26) The value of L y j (t) s obtaned n a smlar way to Ay j, but consderng accesses wth lower prorty than j: L y j (t) = k Γ y k lp( j) W y n,k (t) (27) We note that the carry-n nterference has not been accounted for n [24] Equaton (5) and (6), resultng n potentally optmstc bounds on the number of competng memory requests n [24]. The number of accesses on the cores are used as nput to the BUS functon (see Secton 5), whch we use to derve the maxmum bus delay that task τ on processor P x can experence durng a tme nterval of length t, I BUS (, x, t) = BUS(, x, t) d man (28) where d man s the bus access latency to the global memory. 6.4 Global Memory So far we have assumed a global memory wth a constant access latency d man. Global memory s usually realzed based on dynamc random-access memory (DRAM), whch needs to be refreshed perodcally. Now, we show how to relax the constant-latency assumpton to take nto account delays mposed by refreshes. We assume a DRAM controller wth a Frst Come Frst Served (FCFS) schedulng polcy so that memory accesses cannot be reordered wthn the controller. Further, we assume a closed-page polcy to mnmze the effect of the memory access hstory on access latences. We consder two refresh strateges [31]: dstrbuted refresh where the controller refreshes each row at a dfferent tme, at regular ntervals, and burst refresh where all rows are refreshed mmedately one after another. Under burst refresh, an upper bound on the maxmum number of refreshes wthn an nterval of length t n whch m memory accesses occur s gven by: t DRAM burst (t, m) = #rows (29) T refresh where #rows s the number of rows n the DRAM module, and T refresh s the nterval at whch each row needs to be refreshed. T refresh s usually 64 ms for DDR2 and DDR3 modules. Under dstrbuted refresh, the upper bound s: ( ) t #rows DRAM dst (t, m) = mn m, (30) T refresh Ths s the case, snce at most one memory access can be delayed by each of the refreshes, whereas under burst refresh, a sngle memory access can be delayed by #rows many refreshes. As the number of memory accesses wthn t s equal to the number of BUS accesses, we can bound the nterference due to DRAM refreshes of task τ on core P x as follows: I DRAM (, x, t) = DRAM(t, BUS((, x, t)) d refresh (31) where d refresh s the refresh latency.

7 6.5 Multcore Response Tme Analyss The response tme R of task τ s gven by the smallest soluton to the followng recurrence relaton: R = PD + I PROC (, x, R ) + I BUS (, x, R ) + I DRAM (, x, R ) (32) where I PROC (, x, R ) s the nterference due to processor demand from hgher prorty tasks runnng on the same processor assumng no msses on the local memory (see Equaton (18)), I BUS (, x, R ) s the delay due to bus accesses from tasks runnng on all cores ncludng MD (see Equaton (28)), and I DRAM (, x, R ) s the delay due to DRAM refreshes (see Equaton (31)). Snce the response tme of each task can depend on the response tmes of other tasks va the functons (26) and (27) descrbng memory accesses A y j (t) and Ly j (t), we use an outer loop around a set of fxed-pont teratons to compute the response tmes of all the tasks, and deal wth an apparent crcular dependency. Iteraton starts wth : R = PD + MD d man and ends when all the response tmes have converged (.e. no response tme changes w.r.t. the prevous teraton), or the response tme of a task exceeds ts deadlne n whch case that task s unschedulable. See Algorthm 1 Response Tme Computaton 1: functon MultRTA 2: : R 0 = 0 3: : R 1 = PD + MD d man 4: l = 1 5: whle : R l Rl 1 : R l D do 6: for all do 7: R l,0 = R l 1 8: R l,1 = R l 9: k = 1 10: whle : R l,k R l,k 1 R l,k D do 11: R l,k+1 = PD + I PROC (, x, R l,k ) 12: +I BUS (, x, R l,k ) + I DRAM (, x, R l,k 13: k = k : end whle 15: end for 16: R l+1 = R l,k 17: l = l : end whle 19: f : R l D then return schedulable 20: else return not schedulable 21: end f 22: end functon Algorthm 1 for a pseudo-code algorthm of the response tme calculaton. Snce the response tme R of a task τ s monotoncally ncreasng w.r.t. ncreases n the response tme of any other task, convergence or exceedng a deadlne s guaranteed n a bounded number of teratons. We note that the analyss s sustanable [8] wth respect to the processor PD j and memory demands MD j of each task, snce values that are smaller than the upper bounds used n the analyss cannot result n a larger response tme. Ths sustanablty extends to traces; f any trace of task executon results n practce n a lower processor or memory demand than that consdered by the analyss, then ths also cannot result n an ncrease n the response tme. Smlarly, a decrease n the set of UCBs or ECBs such that they are a subset of those consdered by the analyss cannot ncrease the worst-case response tme. Note that the defntons of MD, PD and ECB completely decouple the traces from the response tme analyss. Ths comes at the cost of possble pessmsm, but strongly reduces the complexty of the analyss. Dfferent traces may maxmze dfferent parameters, meanng that the combnaton of the ) parameters n ths way may represent a synthetc worst-case that cannot occur n practce. An alternatve soluton s to defne a multcore response tme analyss that s parametrc n the executon traces. In the extreme, completely expandng the analyss to explore every combnaton of traces from dfferent tasks would be ntractable. However, as a frst step n ths drecton, response tmes could be computed for each ndvdual trace of the task of nterest τ, usng combned traces for all other tasks. The maxmum such response tme would then provde an mproved upper bound. 6.6 Extensons Above, we nstantated the Multcore Response Tme Analyss (MRTA) framework for relatvely smple task and multcore archtectural models. In the techncal report [2], we brefly dscuss extensons ncludng: RTOS and nterrupts, dynamc scratchpad management, sharng software resources, open systems and ncremental verfcaton, wrte-back cache polces and mult-level caches. The presented analyss framework s not fne-tuned to specfc hardware features or executon scenaros such as burst accesses, snce ths counteracts ts extensblty and generalty. 7. EXPERIMENTAL EVALUATION In ths secton we descrbe the results of an expermental evaluaton usng the MRTA framework 2. For the evaluaton, we use the Mälardalen benchmark sute [18] to provde traces. We model a multcore systems based on an ARM Cortex A5 multcore 3 as a reference archtecture to provde a cache confguraton and memory and bus latences. As ths work s ntended to provde an overvew of our generc and extensble framework, we do not model all detals of the specfc multcore archtecture. A case study comparng measurements on a real hardware wth the computed bounds s future work. ARMv7 ICache ICache DCache DCache ARMv7 ARMv7 ICache ICache DCache DCache ARMv7 IO/ global memory Fgure 3: Multcore Archtecture Case Study: m = 4 cores wth local caches connected va a common bus to a global memory. The reference archtecture depcted n Fgure 3 s confgured as follows: It has 4 ARMv7 cores connected to the global memory/io over a shared bus assumng a Round-Robn arbtraton polcy and a core frequency of 200MHz. Each core has separate nstructon and data caches, wth 256 cache sets each and a block sze of 32Bytes. The global memory latency d man and the DRAM refresh latency d refresh are both 5 cycles. The DRAM refresh perod T refresh s 64 ms. We assume the DRAM mplements the dstrbuted refresh strategy (see Secton 6.4). We examne dervatves of the reference confguraton assumng the dfferent bus arbtraton polces presented n Secton 5 and a hypothetcal perfect bus whch elmnates all bus nterference f the bus utlzaton s 1. We compare the reference confguraton wth two alternatve archtectures: The frst, referred to as full-solaton archtecture mplements complete spatal and temporal solaton. The local caches are parttoned wth an equal partton sze for each task and the bus uses a TDMA arbtraton 2 The software s avalable on demand. 3

8 polcy. All other parameters reman the same as n the reference archtecture. The performance on the solaton archtecture corresponds to the tradtonal two-step approach to tmng verfcaton wth context-ndependent WCETs. The second alternatve, referred to as uncached archtecture, assumes no local caches except for a buffer of sze 1, and uses Round-Robn bus arbtraton. All other parameters are agan the same as the reference confguraton. The traces for the benchmarks were generated usng the gem5 nstructon set smulator [13] and contan statcally lnked lbrary calls. As the benchmark code corresponds to ndependent tasks, no data s shared between the tasks. Table 1 shows nformaton for a representatve selecton of the 39 benchmark programs used to provde traces ncludng the total number of nstructons (whch s equal to the processor demand), the number of read/wrte operatons, the memory demand, and the maxmum number of UCBs and ECBs on the reference multcore archtecture. Complete nformaton for all benchmarks can be found n Table 1 of the techncal report [2]. Each benchmark s assgned only one trace, whch s suffcent due to the smple structure of the benchmark sute: The benchmarks are ether sngle-path or worst-case nput s provded. Despte the rather smple structure of the benchmarks, the tasks show a strong varaton n processor and memory demand. As all benchmarks exhbt only one trace, the worst-case processor and memory demand concde. Evaluaton of more complex tasks ncludng evaluaton of the trade off between pessmsm of ndependent upper bounds and the computatonal complexty of explct traces remans as future work. We dentfy three man sources of over-approxmaton of our multcore response tme analyss framework: The number of memory accesses on the same core cannot be precsely estmated due to mprecson n the pre-empton cost analyss. The nterference due to bus accesses may be pessmstc as not all tasks runnng on another core can smultaneously access the bus. The DRAM refreshes are assumed too frequently f the number of man memory accesses s over-approxmated. A sophstcated evaluaton of the precson of our analyss requres measurements on a real archtecture, whch we cannot yet provde. However, the dfferent archtecture confguratons provde an estmate of the nfluence of the dfferent sources of pessmsm. The reference archtecture wth a perfect bus elmnates any pessmsm due to bus nterference and DRAM accesses. Only the pessmsm of the pre-empton cost analyss remans, whch has been quantfed n [3]. The full-solaton archtecture removes all pessmsm due to the bus nterference and the pre-empton costs, and thus only suffers from the pessmsm n the DRAM analyss. We evaluated the guaranteed performance of the varous confguratons as computed usng the MRTA framework on a large number of randomly generated task sets. The task set parameters were as follows: The default task set sze was 32, wth 8 tasks per core. Each task was randomly assgned a trace from Table 1. The base WCET per task τ, needed solely to set the task perods and deadlne, was defned as C = PD + MD d man + DRAM(PD + MD d man, MD ) d refresh C denotes the executon tme of the task wthout any nterference from any other task. The task utlzatons were generated usng UUnfast [12] wth an equal utlzaton assumed for each core. Task perods were set based on task utlzaton and base WCET,.e., T = C /U. Task deadlnes were mplct. Prortes were assgned n deadlne monotonc order. We note that the processor utlzaton s often not the lmtng factor Name # Instr. (PD) Read/Wrte MD UCB ECB adpcm_enc bsort compress fdct lms nschneu petrnet statemate Table 1: Benchmark traces on a multcore system, but the memory utlzaton, defned as: U BUS MD d man = (33) T Only f U BUS 1, can the tasks be scheduled. The utlzaton per core was vared from to n steps of For each utlzaton value, 1000 tasksets were generated and the schedulablty was determned for each archtectural confguraton. Fgure 4 shows the number of schedulable task sets plotted aganst the core utlzaton (computed usng the base WCETs) and Fgure 5 aganst the bus utlzaton U BUS. Schedulable Tasksets Utlzaton reference confg - perfect bus reference confg - FP bus reference confg - RR bus reference confg - TDMA bus full-solaton archtecture reference confg - PP bus reference confg - FIFO bus uncached archtecture Fgure 4: Number of schedulable task sets vs. core utlzaton Schedulable Tasksets Bus Utlzaton reference confg - perfect bus reference confg - FP bus reference confg - RR bus reference confg - TDMA bus full-solaton archtecture reference confg - PP bus reference confg - FIFO bus uncached archtecture Fgure 5: Number of schedulable task sets vs. bus utlzaton Most traces from Table 1 have a hgh memory demand, whch results n a hgh number of bus accesses even at low core utlzatons. Consequently, most task sets are not schedulable even wth a perfect bus. The fxed-prorty bus (green lne) where the memory accesses nhert the task prorty shows the best performance, followed by Round-Robn (dark blue lne) and then TDMA (pnk lne). The full-solaton archtecture (lght blue) mplementng TDMA and cache parttonng on the local caches performs nearly as well as the TDMA archtecture, whch ndcates that the ncreased executon tmes due to cache parttonng only have a mnor mpact n ths case. Note for TDMA and Round-Robn, we assume a cycle wth 2 slots per processor. The FIFO bus shows the lowest performance, smlar to that of an uncached archtecture, whch uses Round-Robn. The worst-case arrval pattern for a FIFO bus (black lne) assumes that each potentally co-runnng task has ssued bus requests just

9 weghted measure (core utlzaton) Fgure 6: cycles). weghted measure (core utlzaton) bus latency reference confg - perfect bus reference confg - FP bus reference confg - RR bus reference confg - TDMA bus full-solaton archtecture reference confg - PP bus reference confg - FIFO bus uncached archtecture Weghted schedulablty; varyng bus latency (n number of cores reference confg - perfect bus reference confg - FP bus reference confg - RR bus reference confg - TDMA bus full-solaton archtecture reference confg - PP bus reference confg - FIFO bus uncached archtecture Fgure 7: Weghted schedulablty; varyng number of cores. weghted measure (core utlzaton) refresh latency reference confg - perfect bus reference confg - FP bus reference confg - RR bus reference confg - TDMA bus full-solaton archtecture reference confg - PP bus reference confg - FIFO bus uncached archtecture Fgure 8: Weghted schedulablty; varyng DRAM refresh latency (n cycles). before the release of the task of nterest, whch results n a very pessmstc bus contenton and response tmes. The analyss for the Processor-Prorty bus (dark green lne) only assumes that co-runnng tasks assgned to a processor of hgher prorty have ssued requests, whch explans the mproved performance compared to the FIFO bus. We note that the task set generaton does not optmze the task assgnment wth respect to the Processor-Prorty bus. Such an optmzaton could greatly mprove the relatve performance of ths polcy by assgnng tasks wth shorter deadlnes to a processor wth hgher prorty. The dfference between the Fxed-Prorty and Round-Robn/TDMA shows the MRTA framework s able to guarantee good performance even f the bus polcy does not provde a tghtly bounded bus latency for sngle accesses (as s the case for TDMA and Round-Robn). Fgures 4 and 5 only show the results for dfferent bus polces and three cache confguraton (uncached, parttoned and unconstraned cache usage). In the followng, we examne how other parameters ncludng: the man memory latency the number of cores, and the DRAM refresh latency mpact schedulablty. We use the weghted schedulablty measure [9], to show how schedulablty vares wth these parameters. As the memory demand of the benchmark traces s hgh, the bus latency d man has a tremendous mpact on overall schedulablty (see Fgure 6). The bus latency affects all bus polces smlarly. By ncreasng the number of cores, the number of tasks also ncreases (assumng a fxed number of tasks per core) and so does the bus utlzaton. The performance of all confguratons decreases (see Fgure 7) as fewer task sets are deemed schedulable, rrespectve of the bus polcy. As mght be expected, longer DRAM refresh latences have a sgnfcant detrmental effect on schedulablty for all confguratons, see Fgure CONCLUSIONS In ths paper, we ntroduced a Multcore Response Tme Analyss (MRTA) framework. Ths framework s extensble to dfferent multcore archtectures, wth varous types and arrangements of local memory, and dfferent arbtraton polces for the common nterconnects. In ths ntal paper, we nstantated the MRTA framework assumng sngle level local data and nstructon memores (cache or scratchpads), and for a varety of memory bus arbtraton polces, ncludng: Round-Robn, FIFO, Fxed-Prorty, Processor-Prorty, and TDMA. The MRTA framework provdes a general approach to tmng verfcaton for multcore systems that s parametrc n the hardware confguraton (common nterconnect, local memores, number of cores etc.) and so can be used both at the archtectural desgn stage to compare the guaranteed levels of performance obtaned wth dfferent hardware confguratons, and also durng development to verfy the tmng behavour of a specfc system. The MRTA framework decouples response tme analyss from a relance on context ndependent WCET values. Instead, the analyss formulates response tmes drectly from the demands on dfferent hardware resources. Such a separaton of concerns trades dfferent sources of pessmsm. The smplfcatons used to make the analyss tractable are unable to take advantage of overlaps between processng and memory demands; however, ths compromse s set aganst substantal gans acqured by consderng the worst-case behavour of resources, such as the memory bus, over long duratons equatng to task response tmes, rather than summng the worst case over short duratons such as a sngle accesses, as s the case wth the tradtonal two-step approach usng context-ndependent WCETs. Whle the ntal nstantaton of the MRTA framework gven n ths paper cannot capture every source of nterference or delay exhbted n actual multcore processors, t captures the most sgnfcant effects. Importantly, the framework can be: () extended to ncorporate effects due to other hardware resources, and dfferent schedulng / resource access polces, () refned to provde tghter analyss for those elements nstantated n ths paper, () talored to better model the mplementaton of actual multcore processors. Our evaluaton used the MRTA framework to model and analyse a generc multcore processor based on nformaton about the ARM Cortex A5, wth software from the Mälardalen benchmark sute used as code for the tasks n our case study. Our results show that whle a full-solaton solaton archtecture may be preferable wth the tradtonal two-step approach to tmng verfcaton, the MRTA framework can leverage the substantal performance mprovements that can be obtaned by usng dynamc polces such as the Fxed-Prorty bus arbtraton based on task prortes. The techncal report [2] dscusses a varety of ways n whch the framework can be extended. In future we am to explore these avenues, extendng our work by nstantatng the analyss for more complex behavours and archtectures, as well as to global

10 and sem-parttoned schedulng polces. We also plan to run detaled (cycle accurate) smulatons of the multcore archtectures to examne the effectveness of the MRTA framework compared to observed behavour. Acknowledgements Ths work was supported n part by the COST Acton IC1202 TACLe, by the DFG as part of the Transregonal Collaboratve Research Centre SFB/TR 14 (AVACS), by Natonal Funds through FCT/MEC (Portuguese Foundaton for Scence and Technology) and co-fnanced by ERDF (European Regonal Development Fund) under the PT2020 Partnershp, wthn project UID/CEC/04234/2013 (CISTER Research Centre), by FCT/MEC and the EU ARTEMIS JU wthn project ARTEMIS/0001/ JU grant nr (EMC2), by the INRIA Internatonal Char program, and by the EPSRC project MCC (EP/K011626/1). EPSRC Research Data Management: No new prmary data was created durng ths study. Ths collaboraton was partly due to the Dagstuhl Semnar on Mxed Crtcalty References [1] S. Altmeyer. Analyss of Preemptvely Scheduled Hard Real-tme Systems. epubl GmbH, [2] S. Altmeyer, R. I. Davs, L. Indrusak, C. Maza, V. Nels, and J. Reneke. A generc and compostonal framework for multcore response tme analyss. Techncal report, Dept. Computer Scence, Unversty of York, UK, ftpdr/reports/2015/ycs/499/ycs pdf. [3] S. Altmeyer, R. I. Davs, and C. Maza. Cache related pre-empton aware response tme analyss for fxed prorty pre-emptve systems. In RTSS, pages , December [4] S. Altmeyer, R. I. Davs, and C. Maza. Improved cache related preempton delay aware response tme analyss for fxed prorty preemptve systems. Real-Tme Systems, 48(5): , [5] S. Altmeyer, R. Douma, W. Lunnss, and R.I. Davs. Evaluaton of cache parttonng for hard real-tme systems. In ECRTS, pages 15 26, July [6] P. Atanassov and P. Puschner. Impact of DRAM refresh on the executon tme of real-tme tasks. In IEEE Internatonal Workshop on Applcaton of Relable Computng and Communcaton, pages 29 34, December [7] N. Audsley, A. Burns, M. Rchardson, K. Tndell, and A. J. Wellngs. Applyng new schedulng theory to statc prorty preemptve schedulng. Software Engneerng Journal, 8: , [8] S. Baruah and A. Burns. Sustanable schedulng analyss. In RTSS, pages , December [9] A. Baston, B. Brandenburg, and J. Anderson. Cache-related preempton and mgraton delays: Emprcal approxmaton and mpact on schedulablty. In OSPERT, pages 33 44, July [10] M. Bertogna and M. Crne. Response-tme analyss for globally scheduled symmetrc multprocessor platforms. In RTSS, pages , [11] B. Bhat and F. Mueller. Makng DRAM refresh predctable. Real- Tme Systems, 47(5): , September [12] E. Bn and G. Buttazzo. Measurng the performance of schedulablty tests. Real-Tme Systems, 30: , [13] N. Bnkert et al. The gem5 smulator. SIGARCH Comput. Archt. News, 39(2):1 7, August [14] D. Bu, E. Lee, I. Lu, H. Patel, and J. Reneke. Temporal solaton on multprocessng archtectures. In DAC, pages , June [15] S. Chattopadhyay, A. Roychoudhury, and T. Mtra. Modelng shared cache and bus n mult-cores for tmng analyss. In SCOPES, pages 6:1 6:10, June [16] D. Dasar, V. Nels, and B. Akesson. A framework for memory contenton analyss n mult-core platforms. Real-Tme Systems, pages 1 51, [17] C. Ferdnand, F. Martn, R. Wlhelm, and M. Alt. Cache behavor predcton by abstract nterpretaton. Scence of Computer Programmng, 35(2-3): , [18] J. Gustafsson, A. Betts, A. Ermedahl, and B. Lsper. The Mälardalen WCET benchmarks past, present and future. In WCET, pages , July [19] A. Gustavsson, A. Ermedahl, B. Lsper, and P. Pettersson. Towards WCET analyss of multcore archtectures usng UPPAAL. In WCET, pages , Dagstuhl, Germany, July [20] S. Hahn, J. Reneke, and Wlhelm R. Towards compostonalty n executon tme analyss defnton and challenges. In CRTS, December [21] M. Joseph and P. Pandya. Fndng Response Tmes n a Real-Tme System. The Computer Journal, 29(5): , May [22] T. Kelter, H. Falk, P. Marwedel, S. Chattopadhyay, and A. Roychoudhury. Bus-aware multcore WCET analyss through TDMA offset bounds. In ECRTS, pages 3 12, July [23] T. Kelter, H. Falk, P. Marwedel, S. Chattopadhyay, and A. Roychoudhury. Statc analyss of mult-core TDMA resource arbtraton delays. Real-Tme Systems Journal, 50(2): , [24] H. Km, D. de Nz, B. Andersson, M. Klen, O. Mutlu, and R. Rajkumar. Boundng memory nterference delay n cots-based mult-core systems. In RTAS, [25] C.-G. Lee, J. Hahn, Y.-M. Seo, S.L. Mn, R. Ha, S. Hong, C. Y. Park, M. Lee, and C. S. Km. Analyss of cache-related preempton delay n fxed-prorty preemptve schedulng. IEEE Transactons on Computers, 47(6): , [26] C.G. Lee, K. Lee, J. Hahn, Y.-M. Seo, S. L. Mn, R. Ha, S. Hong, C. Y. Park, M. Lee, and C. S. Km. Boundng cache-related preempton delay for real-tme systems. IEEE TSE, 27(9): , [27] Y. L, V. Suhendra, Y. Lang, T. Mtra, and A. Roychoudhury. Tmng analyss of concurrent programs runnng on shared cache mult-cores. In RTSS, pages 57 67, December [28] Yau-Tsun S. L and S. Malk. Performance analyss of embedded software usng mplct path enumeraton. In DAC, pages , June [29] I. Lu, J. Reneke, D. Broman, M. Zmmer, and E. A. Lee. A PRET mcroarchtecture mplementaton wth repeatable tmng and compettve performance. In ICCD, September [30] M. Lv, W. Y, N. Guan, and G. Yu. Combnng abstract nterpretaton wth model checkng for tmng analyss of multcore software. In RTSS, pages , December [31] Mcron Technologes, Inc. Varous methods of DRAM refresh. Techncal report, [32] J. Nowotsch, M. Paultsch, D. Buhler, H. Thelng, S. Wegener, and M. Schmdt. Mult-core nterference-senstve WCET analyss leveragng runtme resource capacty enforcement. In ECRTS, pages , July [33] M. Paoler, E. Quñones, F. J. Cazorla, G. Bernat, and M. Valero. Hardware support for WCET analyss of hard real-tme multcore systems. SIGARCH Comput. Archt. News, 37(3):57 68, June [34] R. Pellzzon, A. Schranzhofer, J.-J. Chen, M. Caccamo, and L. Thele. Worst case delay analyss for memory nterference n multcore systems. In DATE, pages , March [35] P. Radojkovć, S. Grbal, A. Grasset, E. Quñones, S. Yeha, and F. J. Cazorla. On the evaluaton of the mpact of shared resources n multthreaded COTS processors n tme-crtcal envronments. ACM TACO, 8(4):34, [36] J. Reneke and J. Doerfert. Archtecture-parametrc tmng analyss. In RTAS, pages , Aprl [37] J. Rosen, A. Andre, P. Eles, and Z. Peng. Bus access optmzaton for predctable mplementaton of real-tme applcatons on multprocessor systems-on-chp. In RTSS, pages 49 60, Dec [38] S. Schlecker, M. Negrean, and R. Ernst. Boundng the shared resource load for the performance analyss of multprocessor systems. In DAC, pages , June [39] A. Schranzhofer, J.-J. Chen, and L. Thele. Tmng analyss for TDMA arbtraton n resource sharng systems. In RTAS, pages , Aprl [40] A. Schranzhofer, R. Pellzzon, J.-J. Chen, L. Thele, and M. Caccamo. Tmng analyss for resource access nterference on adaptve resource arbters. In RTAS, pages , Aprl [41] J. Yan and W. Zhang. WCET analyss for mult-core processors wth shared L2 nstructon caches. In RTAS, pages 80 89, [42] H. Yun, G. Yao, R. Pellzzon, M. Caccamo, and L. Sha. Memory access control n multprocessor for real-tme systems wth mxed crtcalty. In ECRTS, pages , 2012.