An ILP Formulation for Task Mapping and Scheduling on Multi-core Architectures

Transcription

1 An ILP Formulaton for Task Mappng and Schedulng on Mult-core Archtectures Yng Y, We Han, Xn Zhao, Ahmet T. Erdogan and Tughrul Arslan Unversty of Ednburgh, The Kng's Buldngs, Mayfeld Road, Ednburgh, EH9 3JL, UK Abstract-Mult-core archtectures are ncreasngly beng adopted n the desgn of emergng complex embedded systems. Key ssues of desgnng such systems are on-chp nterconnects, memory archtecture, and task mappng and schedulng. Ths paper presents an nteger lnear programmng formulaton for the task mappng and schedulng problem. The technque ncorporates proflng-drven loop level task parttonng, task transformatons, functonal ppelnng, and memory archtecture aware data mappng to reduce system executon tme. Experments are conducted to evaluate the technque by mplementng a seres of DSP applcatons on several mult-core archtectures based on dynamcally reconfgurable processor cores. The results demonstrate that the proposed technque s able to generate hgh-qualty mappngs of realstc applcatons on the target mult-core archtecture, achevng up to 1.3x parallel effcency by employng only two dynamcally reconfgurable processor cores. I. INTRODUCTION An mportant trend n embedded systems s the use of mult-core archtectures to meet applcaton s functonal and performance requrements. Mult-core desgns offer hgh performance and flexblty, at the same tme promse low-cost and power-effcent mplementatons. However, the semconductor ndustry s stll facng several other technologcal challenges wth mult-core systems. Important ssues n mult-core desgns are the communcaton nfrastructure, memory archtecture, and task mappng and schedulng. In mult-core archtectures, the performance of the entre system s affected by the executon order of tasks and communcatons. It s well known that task mappng and task schedulng are hghly nter-dependent. Therefore the two ssues need to be handled together n order to obtan effcent mappng and schedulng. Dynamc reconfgurable (DR) processor combnes the flexblty of FPGAs wth the programmablty found n general purpose processors (CPUs/DSPs) n a unfed and easy programmng envronment. It s a strong canddate for mult-core systems. In our proposed embedded mult-core platform whch has several DR processors [1], the shared memory heavly affects the executon tme and power consumpton. The tme of data transmsson between dfferent processors must be consdered durng schedulng such that the desgn result can conform to the real stuaton. In addton, n order to meet the system throughput constrants, the desgn s ppelned to construct more effcent archtectures. Ppelnng dvdes the desgn nto concurrently executng stages, thus ncreasng the throughput. In mult-core archtectures all parallel tasks n an applcaton have the potental to be executed smultaneously. However the number of such tasks may exceed the number of avalable processors. Therefore task mappng s requred to assgn the parallel tasks to the avalable processors. In the past, task mergng and task replcaton have been proposed wth the goal of re-allocatng tasks when performance bottlenecks are met. Snce task mergng requres more local memory and task replcaton needs more processors to mplement the same task [2], a mult-core archtecture whch does not feature suffcent memory and processors wll severely lmt the avalable mappng optons usng the exstng methodology. Applcaton development on mult-core archtectures requres the desgner, or automated tool, to dvde tasks between avalable processors and to determne data mappngs for the requred memory elements. A SystemC-based smulaton framework for mappng an applcaton to a platform and evaluatng ts performance has been presented n [3]. The authors n [4, 5] have ntroduced schedulng and mappng parallel applcatons onto an MPSoC platform. Mappng solutons for bus-based and NoC-based MPSoCs have been descrbed n [6] and [7]. Some automated system-level mappng technques for applcaton development on network processors have also been proposed [8]. Ths paper addresses the problem of automated applcaton mappng and schedulng on DR processor based mult-core archtectures. An Integer Lnear Program (ILP) based approach s proposed for loop level task parttonng, task mappng and ppelned schedulng whle takng the communcaton tme nto account for embedded applcatons. The effcacy of the technque s demonstrated by a seres of DSP applcatons. The paper s organzed as follows: Secton 2 ntroduces the target DR processor as well as the target mult-core archtecture. Secton 3 descrbes the task mappng methodology. Secton 4 gves a more detaled descrpton of the problem addressed n ths paper. Secton 5 descrbes the proposed ILP based approach to solve the problem. The expermental results are gven n secton 6 followed by conclusons n secton 7. II. TARGET MULTI-CORE ARCHITECURE Some applcatons demand a closer nterconnecton between the partcpatng processors to acheve the requred performance. Such a communcaton can be realsed usng dstrbuted shared regster fles. The target mult-core platform s desgned for DSP applcatons, whch typcally have ntensve computatons and a stream of nput data. The archtecture descrbed n a prevous work [2] conssts of a selectable number of DR processors, whch communcate wth a shared memory through a full crossbar network. Ths archtecture has been extended and modfed by ncorporatng the shared regster fle nto the system memory archtecture n order to support the loop level parallelsm proposed n ths /DATE EDAA

2 paper. The target mult-core archtecture s based on a recently ntroduced DR processor archtecture [9]. The DR processor offers comparable computaton performance to leadng DSP processors wth a sgnfcant reducton n power consumpton [1]. It contans an array of nstructon-set functonal unts connected by a programmable nterconnecton. The DR processor s realsed usng an array of Instructon-Cells (ICs) that s reconfgured every processor cycle to map data-paths consstng of both dependent and ndependent nstructons. The salent characterstcs of the DR processor are that t s able to be fully customsable at desgn stage and can be set accordng to applcaton requrements. All nstructon cells n the DR processor can be connected to the entre shared memory or regster fle through nterface cells. Ths scheme allows all possble connectons between the nstructon cells and the shared memory elements, but wth a reduced mult-core nterconnecton complexty. The exstng DR processor tool-flow gves full support by provdng all the requred fles for the ILP model, whch nclude a machne descrpton fle, a statc proflng fle, and a task graph (control data dependent graph). III. PROPOSED MAPPING FLOW Many DSP algorthms target streamng based applcatons and need to operate n real-tme: the DSP applcaton must read nput data, process t, and wrte processed results out before the next nput data s ready. A key concern n a DSP system s mantanng real-tme executon. It s necessary to ppelne the applcaton nto concurrently executng stages to meet the throughput constrants of these systems. The mplementaton of DSP applcatons on a mult-core archtecture manly nvolves parttonng, mappng and schedulng the applcaton tasks onto the processors as well as specfyng data mappng and data transmsson between these processors. The parttonng, mappng and schedulng of tasks are complex optmsaton problems, whch need to be solved smultaneously to maxmse the throughput. In addton, the data communcaton tme between dfferent processors must also be taken nto account durng task mappng and schedulng to maxmse the system throughput. The proposed mappng flow allows the desgner to explore the applcaton mplementatons on the target archtecture platform as shown n Fg. 1. Ths paper manly focuses on the automatc mappng. An ILP-based approach s proposed for loop level task parttonng, task mappng, and ppelned schedulng, whle takng the data communcaton tme nto account for embedded applcatons targeted on the mult-core platform. Durng the process of task parttonng and task mappng, the estmaton of executon-tme for the dfferent tasks as well as for transferrng the data between processors s requred. It s also necessary to schedule the executon order of these ppelned tasks to mprove the system performance. The executon tme of a task can be obtaned from proflng fle generated by the sngle DR smulator. The mappng flow starts from the descrpton of an applcaton n standard sequental C code whch s then optmsed and profled for a sngle DR Fg.1: Mappng methodology processor mplementaton. Applcaton developers can use the generated task mappng and schedulng nformaton and a task-level nterface (TLI) to buld mult-core applcaton code. A TLI nterface s an applcaton programmng nterface. It can be used for developng parallel applcaton program on mult-core archtectures. The TLI nterface provdes servces for nter-task communcaton and task allocaton. It must allow parallelsm and communcaton to be made explct to enable mappng to mult-core archtectures. For example, f a task uses an abstract nterface for synchronzaton wth other tasks, t hdes the detaled mplementaton of the synchronzaton. The mult-core applcaton code s compled and smulated wth the sngle DR processor. The sngle DR processor smulator generates an executon trace fle, whch s used as an nput to the mult-core smulator (MRPSIM) [9]. MRPSIM s a trace-drven smulator whch can correctly and effcently smulate the run-tme mult-core envronment, allowng the throughput of the modelled system to be measured. The proposed mappng approach also takes nto account the task graph (control data flow graph), the mult-core archtecture model, and statc proflng fle. The statc proflng nformaton contans the tmng characterstcs for each task and the access frequency for the varous data tems. The mult-core archtecture model, also called the machne descrpton fle, conssts of the set of processors and the set of memory. These are used for mappng tasks to avalable processors as well as mappng varous data tems to memory archtecture. Our soluton conssts of dvdng the problem nto two stages and solvng each consecutvely. The frst stage assgns and schedules tasks to processors, assumng an dealstc memory mappng, where all data tems are mapped to the fastest possble level of memory, gnorng memory capactes. The ILP formulaton of ths stage ncludes task mergng, task replcaton, and loop level splttng and fuson. The task mergng combnes several tasks nto a sngle task that performs tasks n an effcent order. Ths technque reduces the number of requred processors, but needs more local nstructon and data memores. Task replcaton assgns the same task to

3 several processors such that all nstances of the task are executed n parallel. Therefore, task replcaton needs more processors to mplement an applcaton and also more global memory to save the shared data. A new mappng approach s needed when the workload among the processors s unbalanced and task replcaton cannot be used due to the lmted number of avalable processors. The new mappng approach dvdes the tasks at basc block level nstead of at the functon level n order to explore the loop level parallelsm. In tasks and communcatons schedulng process, n order to consder data dependency between tasks and resolve resource contenton, we model the schedulng problem wth data dependency constrants between tasks, constrants that represent resource contenton. The task graph generated by the DR compler provdes basc block and functon level control and data dependency. A task s defned as a small procedure, functon or just a basc block. The start task has no ncomng arcs, and end (leaf) task has no outgong arcs. The ILP model wth and wthout functonal ppelnng s proposed n [10] but can only handle ths n a restrcted way. The restrcton s that the frst computaton of a task n the (+1)-th teraton s only possble f all leaf tasks are fnshed n the -th teraton. Ths lmtaton wll generate neffcent soluton for some task graphs, whch s llustrated n Fg.2. Fg. 2(a) gves a task graph of a small example. The tme nterval between two successve teratons of the algorthm s called latency (LT). The overall computaton tme (OCT) of n frames wthout functonal ppelnng s equal to noet, where OET s the overall executon tme for one frame. The method gven n [10] provdes a longer latency (LT=OET) whch s shown n Fg. 2(b). Our model removes the above lmtaton, whch results n more effcent mappng and schedulng shown n Fg. 2(c). Ths approach wll be llustrated n detal n the followng two sectons. (a) task graph Pm Pk Pm Pk processor processor LT LT (b) wth OCT = OET + LT = 2*OET (c) wth OCT = OET + LT < 2*OET Fg.2. Advantage of functonal ppelnng The second stage performs a mappng of data tems to memory archtecture and explores the memory archtecture for mnmzng memory access latences. Each DR processor can access three types of memory: (a) the shared mult-bank regster fle, (b) local memory, and (c) shared memory. The local memores are prvate to a processor and cannot be accessed by other processors. Thus, shared data tems, accessed by tasks assgned to dfferent processors, cannot be mapped to these memores. Instead, they are mapped ether to the shared mult-bank regster fle or to shared memory, whch can be accessed by the dfferent processors. Here, we adopted a smlar ILP formulaton gven n [8] for ths stage. tme tme IV. MAPPING HEURISTICS AND PROBLEM DEFINITION In ths secton, we defne the problem of task mappng and schedulng for DR processor based mult-core archtectures. Gven a task graph, a mult-core target archtecture wth ts parameters, and a mappng of tasks and data on the target archtecture ncludng processors and memory, the problem s to fnd a mappng and schedulng of task executons and communcaton transactons whch yelds mnmum executon tme of the task graph on the target archtecture. To solve the problem, target archtecture, task and applcatons defntons are presented. Then, an ILP formulaton or a heurstc algorthm s ntroduced to map and schedule tasks and communcatons. A. Archtecture Defnton A target mult-core archtecture s specfed by the set of processors P and the set of memory elements M. Each processor p s a 2-tuple p=(pd, pcm) where pd s the processor dentfer and pcm s the nstructon memory of the processor. Each memory element m s gven by a 4-tuple m=(md, mc, mt, nm), where md s the memory element dentfer, mc s the capacty of the memory, mt s the tme requred to access the memory, and nm s the type of the memory (shared memory, local memory, or shared regster fle). The target archtecture model based on the above specfcaton s extracted from the mult-core archtecture model fle. B. Applcaton Defnton An applcaton s represented by the herarchcal task graph, whch s an acyclc drected graph G=<V, E>, where the vertex set V s a set of tasks and the edge set E s communcaton edges. Each small procedure or a basc block s defned as a task whch s a 5-tuple t=(td, pt, tc, td, nd) where td s the task dentfer, pt s the task executon tme excludng memory conflctng accesses delay, tc s the total nstructon memory requred by the task, td s the amount of memory requred to store the local data of the task, and nd s the number of tmes the local data tem s accessed. Each communcaton edge s descrbed wth the form (md, sd, sd, nsd), where md s the master task dentfer, sd s the slave task dentfer, sd s the amount of memory requred to store the shared data between two tasks, and nsd s the number of tmes the shared data tem s accessed. The applcaton task model s profled to obtan tmng characterstcs for each task and the access frequency for the varous data tems. The objectve s to map and obtan a statc mappng and ppelned schedulng of the task graph on the target mult-core archtecture such that the throughput s maxmzed whle satsfyng performance constrants. The result of the mappng procedure s the decson whch tasks run on whch processors at what tme. V. ILP FORMULATIONS In ths secton, we present the ILP formulaton that gves an optmal soluton for the problem descrbed n secton 4. Our soluton s based on the mappng strategy gven n [8] and extended the ILP model wth ppelned schedulng. The ILP model supports task mergng and task replcaton. It means that

4 a task may be performed on several processors n order to explot more parallelsm. We assume that each processor has ts own local memory and only one task can be executed at a tme by one processor. Each DR processor can execute all tasks. Tasks on dfferent processors can be executed n parallel. The ILP formulaton ncorporates task mergng and replcaton by frst assgnng processes nto batches, whch are then assgned or replcated to processors [8]. Now, a short summary of the abbrevaton s gven. N represents the number of tasks; M s the number of avalable DR processors. T set of tasks T = {T 1, T 2,, T n L set of the end (leaf) tasks L T S set of the start tasks S T P set of processors P = {P 1, P 2,, P m B set of batches B = {B 1, B 2,, B l {l= mn(m, n) Constants and varables used n the ILP formulaton: s j, k start tme of task T on processor P j n the k -th teraton communcaton tme for sendng data from to ' C ', k n the k - th teraton, f there s (, ') E; 1 task T s assgned to batch l tl 0 otherwse 1 batch B l s assgned to processor j blj 0 otherwse 1 batch B l s replcated on j processors rlj 0 otherwse tasks T and T ' are allocated on the same processor 1 d ' j P j, and T starts executon before task T ' 0 otherwse the processor P j that extectue task T provdng 1 data for any task ' wth (, ' ) E; the two tasks x j are allocated on the dffernt processors 0 otherwse An objectve functon dependng on the system throughput (TP) and the latency (LT) need to be mnmsed. TP and LT are contnuous varables n our ILP model. The objectve functon s gven below. The weghts k 1 and k 2 of the costs TP and LT can be tuned by the desgner. The objectve functon used n the ILP formulaton: mnmse ( k TP + k LT ) 1 2 Constrants used n the ILP formulaton: Every task must be assgned to a sngle batch; T : t = 1 l (1) l B A batch s replcated on n processors, and then exactly n processors must execute that batch. l B n r = b (2) : ln lj n j P Each processor must be assgned to a sngle batch. j P: blj = 1 (3) l B A batch must be assgned to one or more processors only f there s at least one processor assgned to the batch. Otherwse, the batch can be gnored. l B : b MAX _ VAL t lj l (4) j P T where MAX_VAL s a very large value. The nstructon sze of all the tasks assgned to a batch cannot exceed the sze of the avalable nstructon memory of the processor. l B, j P: tl tc( ) pcm( j) (5) T The throughput s equal to the maxmum effectve tme over all batches. rln l B: TP tl pt( ) n n T (6) The fnshng tme of each leaf task s less than or equal to OET L, j P: sj,0 + pt() OET+ (1 tl blj ) MAX_ VAL (7) A data dependency constrant exsts between the two tasks T and T f there s an edge between two tasks T and T n the task graph G(V, E). The executon of task T has to be fnshed before the executon of T f they are on the same processor [constrant (8.1)]. When the tasks are allocated to dfferent processors, T can start c tme unts after T has fnshed [constrant (8.2)]. Snce the ILP model supports task replcaton, a task can be allocated to several processors. We need to consder all possble task allocatons of replcated tasks on all processors and also need to know whch processor that executes task T provdes the data for task T. Ths s done by varable x j gven n the begnnng of secton. (, ') E, j P: sj, k + pt() s ' j, k + (2 tj t ' j ) MAX _ VAL (8.1) (, ') E, j, j' P, j j': (8.2) sj, k+ pt() + C', k s' j', k+ (2 t' j' + tj' x j) MAX _ VAL Two ndependent tasks must not be executed on the same processor at the same tme..e., Task T s executed ether before task T ( d ' j = 1 ) (9.1) or after task T ( d ' j = 0 ) (9.1) on processor P j. (, ') E, j P: sj, k + pt() s ' j, k + (3 tj t ' j d ' j ) MAX _ VAL (9.1) s ' j, k + pt(') sj, k + (2 tj t ' j + d' j ) MAX _ VAL (9.2) The start tme of all tasks have to be postve. T, j P :, 0 (10) If processor P j executes task T ( x j = 1), then task T s assgned one batch and ths batch s allocated to processor P j., : j l lj (11) T j P x t b l B To mnmse the OCT t s necessary to begn the start task n teraton (+1) as soon as possble. Each task wthout replcaton should be allocated to the same processor n each teraton. In addton, t s requred that each start task of (+1)-th teraton can only start on a

5 processor after ths task s -th teraton s fnshed on ths processor. If a task has been replcated, there s no constrant between the dfferent teratons. The latency tme s affected by the frst start task of the -th teraton and the frst task of the (+1)-th teraton. S, j P : s + LT s + (1 t b ) MAX _ VAL (12) j, k j, k+ 1 l lj VI. CASE STUDY: LOOP LEVEL PARALLELISM The followng secton demonstrates the effectveness of the proposed mappng methodology usng a seres of DSP applcatons. The applcaton set ncludes (1) a 64-tap Fnte Impulse Response (FIR) flter, (2) an Advanced Encrypton Standard (AES) applcaton, (3) a Fast Fourer Transform (FFT) applcaton, (4) a smoothng and edgng mage processng applcaton for a 256*256 grayscale mage, and (5) a Freeman demosacng applcaton for a 1138*850 RGB mage. Some compler optmzaton technques have been adopted n our mult-core mappng models, whch ncludes loop splttng and loop fuson. Loop splttng attempts to smplfy a loop or elmnate dependences by breakng t nto multple loops whch have the same bodes but terate over dfferent contguous portons of the ndex range. Loop fuson (loop combnng) attempts to reduce loop overhead. When two adjacent loops terate the same number of tmes, ther bodes can be combned as long as they make no reference to each other's data. Let us consder the applcaton of the 64-pont 6-stages radx-2 FFT to demonstrate loop splttng. Mappng soluton s not only dependent on the strategy but also on the archtecture desgn. The mult-core FFT mplementaton s manly affected by the number of processors and the shared regster fle sze n the mult-core archtecture. To demonstrate the proposed mappng methodology, the FFT applcaton s mapped to several dfferent mult-core archtectures ncludng: (a) lmted processor cores wth lmted shared regster banks, (b) lmted processor cores wth suffcent shared regster banks, and (c) suffcent processor cores wth suffcent shared regster banks. The FFT-I applcaton s mapped onto a mult-core archtecture wth two processor cores, suffcent local and shared memores and an 32*32 shared regster fle shown n Table 3. The most tme consumng part of the applcaton s the TABLE I. for (stage=0; stage<stages; stage++) { shuffle(n, out, SIZE); for (=0; <SIZE; =+2) { getw(&w, (/2), stage); fft(w, out[], out[+1], &(n[]), &(n[+1])); A CODE EXAMPLE OF LOOP SPLITTING for (stage=0; stage<stages/2; stage++) { shuffle(t1, t2, SIZE); for (=0; <SIZE; =+2) { getw(&w, (/2), stage); fft(w, t2[], t2[+1], &(t1[]), &(t1[+1])); /* assgned to Processor 0*/ for (stage=stages/2;stage<stages; stage++) { shuffle(t1, t3, SIZE); for (=0; <SIZE; =+2) { getw(&w, (/2), stage); fft(w, t3[], t1[+1], &(t1[]), &(t1[+1])); /* assgned to Processor 1 */ TABLE II. Intal Code /* Smooth Processor 0 */ for (y=0; y<height; y++) sharedimage[(y*width)+x] = flter(x, y, smooth, mage); /* Laplacan Processor 1 */ for (y=0; y<height; y++) result[(y*width)+x] = flter(x, y, laplacan, sharedimage); A CODE EXAMPLE OF LOOP FSION Combnng Code for (y=0; y<height+3; y++) { /* Smooth Processor 0 */ f (y<height) sharedimage[(y*width)+x] = flter(x, y, smooth, mage); /* Laplacan Processor 1*/ f (y>=3) result[((y-3)*width)+x] = flter(x, (y-3), laplacan, sharedimage); 2-level loop body shown n Table 1. The loop s splt nto two parts: the frst part executes the begnnng 3 stages and the second part executes the last 3 stages (Table 1). Snce there s a lmted shared regster fle, whch s not suffcent to save all the shared data. The FFT wll make use of shared memory to send data from one processor to another. In the general case, data cannot smply be wrtten to and read from shared memory n a mult-core archtecture. Programmer can use the mutex or semaphore nstructon defned n TLI nterface to synchronse the data transfer between processors. If there are suffcent shared regsters, the shared data tems can be mapped to the shared regsters nstead of the shared memory. Ths detaled mplementaton called FFT-II s gven n Table 3. These shared regsters are 8 bank 32*32 bt data regsters, whose access tme (2ns) s much smaller than one of shared memory (5ns). Therefore, frequently read and wrtten shared data should be mapped to the shared regster fle n order to reduce memory access tme and memory access conflcts. A more effcent mappng can be mplemented for a mult-core archtecture whch has three avalable processor cores (FFT-III) and suffcent shared regsters. The number of stages s evenly dvded nto three processors, and each processor executes 2 stages of FFT. A smlar technque s adopted by the 64-tap FIR flter. The FIR-I flter s splt up and mplemented on two processors archtecture: the frst processor executes the frst 32 taps and the second processor executed the last 32 taps. The FIR-II applcaton has been mplemented on 4 processors wth each processor executng 16 taps. A fully parallel mplementaton of the 64-tap FIR flter requres 64 processors, whch s determned by the number of taps. An mage processng applcaton (IMP) ncludes two stages: mage smoothng and edge enhancement. Image smoothng attempts to capture mportant patterns n the mage data whle leave out nose. Edge enhancement s a dgtal mage processng flter that mproves the apparent sharpness of an mage. The ntal mplementaton s gven n Table 2. Edge detecton wats for entre mage smoothng to fnsh before t begns. The mplementaton performance s lmted by synchronzaton; however there s no need to wat for entre mage. Two stages can synchronze at every lne of pxels. Loop combnng s adopted n the IMP applcaton. Combnng code s gven n Table 2, whch results n nearly two tmes better performance of the orgnal code, and nearly 90% processor effcency compared to the orgnal 50% effcency. A seres of DSP applcatons targeted on several mult-core

6 TABLE III. DATE MAPPING Memory Capacty Memory Access Count Apps. Shared Local S.Reg. Shared Local S. Reg. (KB) (KB) (b) (KB) (KB) (b) FIR(I) * , FIR(II) *32*32 2,062 5,041 1,086 AES * FFT(I) * , FFT(II) *32* , FFT(III) *32* , IMP *32*32 212, , Freeman *32*32 1,5451, ,414 Apps. No.of Proc. TABLE IV. IMPLEMENTATION RESULT Executon me (ms) Speedup parallel effcency Average Idle Rato FIR(I) % FIR(II) % AES % FFT(I) % FFT(II) % FFT(III) % IMP % Freeman % archtectures are descrbed n Tables 3 and 4. To make far performance comparsons, all applcatons are executed wth 100 frames. The expermental results are based on the followng assumptons: the DR processors operate at 500MHz, the shared memory access delay s 5ns, the local prvate memory access delay s 4ns, and the shared mult-bank regster fle access delay s 2ns. The memory szes together wth the number of memory access operatons for each applcaton are gven n Table 3, where each processor has an equal local memory sze. A memory access occurs when nformaton s read from or wrtten to a memory unt. Table 3 also provdes the average number of memory access operatons per frame. Table 4 shows the total executon tme, the speedup, parallel effcency, and average dle rato of the dfferent applcatons. Speedup refers to the amount by whch a parallel algorthm speeds-up compared to a correspondng sequental algorthm. The parallel effcency (PE) metrc ndcates how effcently the processors are utlzed n solvng the problem, and s obtaned by dvdng the speedup acheved by the number of processor cores used. The dle rato (IR) metrc refers to the rato between the perods when a DR processor core s dle and the overall smulaton tme. The IR gven n Table 4 provdes the average dle rate of each processor. As the results show, the applcatons wth suffcent processors and memory elements acheve both the hghest speedup and the hghest parallel effcency, compared to other task mappng solutons. In all applcatons wth loop splttng and loop fuson, all processor cores are much more effcent wth very low dle ratos. The FIR flter wth suffcent local memory and shared regsters gans a super lnear speedup [11] where the speedup s greater than the number of processor cores. The super lnear speedup obtaned n ths paper attrbutes to the data localty whch reduces the accesses to the slower shared data memory and dramatcally mproves the performance. Up to 2.59x super lnear speedup and a parallel effcency of 1.30 have been acheved wth only two DR processor cores wth a local memory and employng loop splttng. The smulaton results show that the proposed mappng methodology wth loop level parallelsm provdes superor performance.. VII. CONCLUSIONS The focus of ths paper s on modelng the task mappng and schedulng problem as an ILP whch allows the use of standard tools for solvng t. The proposed mappng technque utlzes proflng-drven task parttonng and loop level transformatons. These are ntellgently fused wth loop splttng, loop fuson and a memory aware data mappng n order to reduce system executon tme. Several applcatons based on dfferent mult-core archtectures have been generated usng our mappng and schedulng tool. Smulaton results demonstrate the effectveness of the proposed mappng and schedulng strateges, showng up to 1.3 parallel effcency for a mult-core archtecture wth two DR processor cores. REFERENCES [1] S. Khawam, I. Nousas, M. Mlward, Y. Y, M. Mur, and T. Arslan, "The Reconfgurable Instructon Cell Array," Very Large Scale Integraton (VLSI) Systems, IEEE Transactons on, vol. 16, pp , [2] We Han, Yng Y, M. Mur, N. Ioanns, T. Arslan, and A. T. Erdogan, Effcent Implementaton of WMAX Physcal Layer on Mult-core Archtecture wth Dynamcally Reconfgurable Processors, Scalable Computng: Practce and Experence Scentfc nternatonal journal for parallel and dstrbuted computng, Vol. 9, ISSN , [3] T. Kempf, M. Doerper, R. Leupers, G. Asched, H. Meyr, T. Kogel, and B. Vanthournout, "A modular smulaton framework for spatal and temporal task mappng onto mult-processor soc platforms," n Proceedngs of the conference on Desgn, Automaton and Test n Europe (DATE), pp , 2005,. [4] P. G. Pauln, "Automatc mappng of parallel applcatons onto multprocessor platforms: a multmeda applcaton," n Dgtal System Desgn, Euromcro Symposum, pp. 2-4, [5] N. Pazos, A. Maxagune, P. lenne, and Y Leblebc, "Parallel modelng paradgm n multmeda applcatons: Mappng and schedulng onto a mult-processor system-on-chp platform", n Proceedngs of the Internatonal Global Sgnal Processng Conference, Santa Clara, Calforna, [6] M. Ruggero, A. Guerr, D. Bertozz, F. Polett, M. Mlano, Communcaton-aware allocaton and schedulng framework for stream-orented mult-processor system-on-chp, n Proceedngs of the Conference on Desgn, Automaton and Test n Europe (DATE), pp. 3-8, [7] C. Marcon, A. Born, A. Susn, L. Carro, F. Wagner, me and Energy Effcent Mappng of Embedded Applcatons onto NoCs, Asa and South Pacfc Desgn Automaton Conference (ASP-DAC), pp , Vol. 1, [8] C. Ostler and K.S. Chatha, An ILP Formulaton for System-Level Applcaton Mappng on Network Processor Archtectures, Desgn, Automaton & Test n Europe Conference & Exhbton, (DATE), pp.1-6, [9] We Han, Yng Y, M. Mur, N. Ioanns, T. Arslan, and A. T. Erdogan, "MRPSIM: a TLM based Smulaton Tool for MPSoCs targetng Dynamcally Reconfgurable Processors," 21st Annual IEEE Internatonal SOC Conference, pp , September, [10] A. Bender, Desgn of an Optmal Loosely Coupled Heterogeneous Multprocessor System, Proceedngs of European Desgn and Test Conference (ED&TC 96), pp , [11] Davd Culler, J.P. Sngh and A. Gupta, Parallel Computer Archtecture: A Hardware/Software Approach, Morgan Kaufmann, 2nd edton, 1999.