Multimedia Multiprocessor Systems: Analysis, Design and Management. Akash Kumar

Transcription

1 Multimedia Multiprocessor Systems: Analysis, Design and Management Akash Kumar

2 2 Modern Multimedia Embedded Systems

3 3 Trends in Multimedia Systems Increasing number of features i.e. applications Simultaneously active applications Power increasingly becoming more important Short time-to-market, new devices released every few months Multiple standards to be supported Multiprocessors being used increasingly

4 4 Challenges in Multimedia System Design Ensuring all applications can meet their performance Handle the huge number of use-cases i.e. combinations of applications Each possible set of applications leads to a new use-case For 10 applications there are over a thousand use-cases! Limit the design time Late launch of products directly hurts profits Increased design-time implies higher design costs Deal with dynamism in the applications

5 5 Contributions Analysis Accurately predict performance of multiple applications executing concurrently Basic and iterative probabilistic techniques Design Synthesizing MPSoC for multiple applications Synthesizing MPSoC for multiple use-cases Management Resource manager for MPSoC systems Admission control and budget enforcement

6 6 Assumptions Heterogeneous MPSoC used increasingly more Different levels of parallelism in application uproc better for control-flow DSP better for signal processing Dedicated hardware blocks needed for certain parts Improves efficiency and saves power Applications modeled as SDF First-come-first-serve arbiter at cores Non-preemptive system tasks can not be stopped

7 7 Non-Preemptive Systems Task State-space needed is smaller Lower implementation cost Less overhead at run-time Cache pollution, memory size

8 8 Design Flow Use-case 2 System Design and Synthesis (Chapter 5 & 6) a0 a1 A a2 a3 b0 b1 B b2 Hardware Specification a0 a2 b1 b0 b2 Use-case 1 Applications Specifications Performance Analysis (Chapter 3) Throughput c0 c1 C c2 Use-case 3 Analysis Results A B C Applications Admission Control (Chapter 4) a0 Arbiter b1 a2 Arbiter Arbiter Hardware Specification Arbiter Arbiter Arbiter Arbiter Arbiter RM a1 a3 Arbiter RM a0 b1 b0 b2 Arbiter Arbiter Arbiter Arbiter RM a1 a3 Arbiter RM a0 b1 Budget Enforcement (Chapter 4)

9 9 Outline Introduction Multimedia Multiproc Systems Introduction to SDF Analysis Basic Probabilistic Performance Prediction Iterative Probabilistic Performance Prediction Design Synthesizing MPSoC for multiple applications Synthesizing MPSoC for multiple use-cases Management Resource Management for MPSoC systems

10 10 Synchronous Dataflow Graphs First proposed in 1987 by Edward Lee SDF Graphs used extensively SDFG: Synchronous Data Flow Graphs DSP applications Multimedia applications Similar to task graphs with dependencies

11 11 Synchronous Dataflow Graphs actor rate token channel execution time A 2 α 3 B 1 β 2 C fire A A 2 α 3 B 1 β 2 C

12 12 Synchronous Dataflow Graphs A 2 α 3 B 1 β 2 C fire B A 2 α 3 B 1 β 2 C

13 13 Synchronous Dataflow Graphs Example H263 Decoder VLD , IQ , ,800 IDCT ,000 1 Reconstruction 1188

14 14 Synchronous Dataflow Graphs Advantages Easily allows performance analysis of single applications Communication buffers can be easily modeled Disadvantages Sharing of resources is hard to model Only static resource arbitration can be modeled: infinite possibilities with multiple applications Difficult to analyze performance of multiple applications executing concurrently Unable to handle dynamism in the application

15 15 Problem: Predicting Multiple Application Performance A B Two applications each Mapping with & Scheduling three actors Mapped on a heterogeneous platform Non-preemptive scheduler P1 P2 P3

16 16 Considering Only Actors on a Processor A B Task Only Actors Individual Graph Worst Case A B Total Static Priority Based A pref. B pref. Iteration count for each task for 3,000 cycles

17 17 Considering Only Applications A B Task Only Actors Individual Graph Worst Case A B Total Static Priority Based A pref. B pref. Iteration count for each task for 3,000 cycles

18 18 Worst Case Waiting Time A B P1 P2 P3 Wait A Calculate waiting time

19 19 Worst Case Waiting Time A B P1 P2 P3 A

20 20 Worst Case Waiting Time Unrealistic! Lower Bound Task Only Actors Individual Graph Worst Case A B Total Static Priority Based A pref. B pref. Iteration count for each task for 3,000 cycles

21 21 Static Order Arbitration A B Add ordering dependencies (edges) P1 P2 A B P3 t 0 t 1 t 2 Steady t 3 state

22 22 Problem: Predicting Performance A B Task Only Actors Individual Graph Worst Case Static A B Total Priority Based A pref. B pref. Iteration count for each task for 3,000 cycles

23 23 Problem: Predicting Performance Priority Based A B P1 P2 A B P3 t 0 t 1 Steady t 2 t 3 State

24 24 Problem: Predicting Performance A B Task Only Actors Individual Graph Worst Case Static Priority Based A pref. B pref. A B Total Iteration count for each task for 3,000 cycles

25 25 Problem No good techniques exist to analyze performance of multiple applications on non-preemptive heterogeneous systems Use probabilistic approach to estimate the performance of multiple applications running on an MPSoC platform

26 26 Analyzing Multiple Applications Performance When resources need to be shared, the actor execution may be delayed Determining this waiting time is the key t resp = t exec + t wait???

27 27 Probability Distribution Compute the probability distribution of a resource being blocked by an actor A 2/3 1/1 P(x) 1/3 E( x) 1 = x. dx x =. 1 2 x denotes the time other actors have to wait for respective resources to be free from actors of A E(x) provides the expected time an actor will need to wait when sharing resources with actors of A x 0 = 8

28 28 Updated Response Time A B A B 58 58

29 29 Basic P 3 Algorithm Compute throughput of all applications Compute the probability of blocking a resource Estimate the waiting time for all actors Update the response time for all actors Response time = execution time + waiting time Re-compute the application throughput

30 30 Basic P 3 Algorithm Exponential Complexity So if actor a i and b i are mapped on the same resource, b i on average will need to wait for

31 31 Complexity Reduction Overall complexity is O(n n ) n is the number of actors mapped on a processing resource Higher order probability products Limit the equation to only second or fourthorder Complexity reduces significantly Algorithm Complexity Original O(n n ) Second-order O(n 2 ) Fourth-order O(n 4 )

32 32 Probabilistic Performance Prediction (P 3 ) Basic P 3 technique Looks at all possible combinations of other actors blocking a particular actor Results in exponential possibilities Iterative P 3 technique Looks at how an actor can contribute to waiting time of other actors Results in linear complexity Iterating over the algorithm while updating throughput improves the estimate further

33 33 Determining the Waiting Time Three states of an actor Not ready data not present Actors arriving in this state, are not affected by this actor Ready and waiting data present, but resource is busy Actors arriving in this state have to wait for the full execution of this actor Ready and executing data and resource available Waiting time for other actors depend on where the actor is in its execution Uniform distribution assumed

34 34 A s Waiting Time Due to B A B C D B not in queue B being served Arbiter Processor B waiting in queue

35 35 Updated Probability Distribution P(x) When the actor is not ready texec E ( x) = Pw. texec + Pe. 2 1-P w -P e P w When the actor is in queue P e 0 t exec x When the actor is executing

36 36 Updated Probability Distribution Conservative P(x) When the actor is not ready E( x) = P 1-P w -P e P w When the actor is in queue w = ( P. t w exec + P. t e e + P ). t exec exec 0 P e t exec x When the actor is executing

37 37 Iterative Probability Iterate until the analysis estimate stabilizes Updating the throughput in one iteration Compute throughput of all applications Compute the probability of blocking a resource both while waiting and executing Estimate the waiting time for all actors Update the response time for all actors Response time = execution time + waiting time Re-compute the application throughput

38 38 Experimental Results SDF 3 tool used to generate random graphs Ten graphs generated Each had 8-10 actors Over 1000 use-cases generated Simulations performed using POOSL Parallel Object Oriented Specification Language 28 hours for simulation 10 min for analysis using all approaches

39 39 Iterative Analysis all applications together Application period (normalized to original) A B C D E F G H I J Original Simulation Worst case WCSim Basic Iterative Applications

40 40 Iterative Analysis all applications together Application period (normalized to simulated) A B C D E F G H I J Simulation Basic Iterative Conservative Applications

41 41 Case-study with Mobile Phone Applications 160 Period of Applications (Normalized to original period) H263 Decoder H263 Encoder Simulation Iterative Analysis Conservative Analysis Worst Case Basic - Fourth Order JPEG Decoder Modem Voice Call Applications

42 42 FPGA Implementation Results Algorithm/Stage Load from CF Card Throughput Computation Worst Case Second Order Fourth Order Iterative - 1 Iteration Iterative - 1 Iteration* Iterative - 5 Iterations* Iterative - 10 Iterations* Clock cycles ms with 100 MHz Error (%age) Average Max N-number of applications n-number of actors in an application k-number of throughput equations for an application m-number of actors mapped on a processor M-number Copyright of processors 2010 Akash Kumar Complexity O(N.n.k) O(N.n.k) O(m.M) O(m 2.M) O(m 4.M) O(m.M) O(m.M+N.n.k) O(m.M+N.n.k) O(m.M+N.n.k) 2.8ms with 100 MHz

44 44 Problem Current Design Practice for multiple applications Manual or Semi-automated Which is Error Prone Time Consuming

45 45 Current Tools - Example Xilinx Automatic tool chain limited to single processors No Support for multiple applications Design space exploration is manual

46 46 Solution Multi Application Multi-Processor Synthesis A design-flow that takes in application(s) specifications Generates the entire MPSoC hardware Creates the software models for it Real C-program can also be run Provides two main benefits Fast design space exploration Support for multiple applications

47 47 MAMPS Overview

48 48 MAMPS Software Arbitration Static Scheduling Dynamic Scheduling

49 49 MAMPS Example H263 Decoder IQ , VLD 120, ,000 1 Reconstruction , IDC T

50 MAMPS Example H263 Decoder Pro 0 VLD Pro 1 IQ Pro 2 IDCT Pro 3 Recon BUS Timer UART CF Card DDR RAM FIFO LINKS

51 51 Standalone Automated DSE Data Collection

52 52 DSE Case Study Buffer-throughput trade-off JPEG and H263 decoders

53 53 DSE Case Study Design Time Manual Design Generating Single Design Complete DSE Hardware Generation ~2 days 40ms 40ms Software Generation ~3 days 60ms 60ms Hardware Synthesis 35:40 min 35:40 min 35:40 min Software Synthesis 0:25 min 0:25 min 10:00 min Total time ~5 days 36:05 min 45:40 min Iterations Average time/ iteration ~5 days 36:05 min 1:54 min Speed-Up - 1x 19x Speedup!

54 54 MAMPS Used by following people Ahsan Shabbir TUe. Michiel Rooijakkers TUe. Thom Gielen TUe and NUS, Singapore. Abhinav Krishna NUS, Singapore. Priyantha Desilva NUS, Singapore. Shakith Fernando NUS, Singapore. Zhonglei TU Munchen, Germany. James Young - Brigham Young University. Amit Kumar Singh Nanyang Technical University, Singapore. Guan Yu IMEC, Belgium.

55 55 Handling Multiple Use-cases For rapid prototyping, hardware synthesis time is the bottleneck Limits the design space exploration For real system, more use-cases implies More memory to store the configuration Increased switching Use-case merging and partitioning Reduces the number of partitions Reduces the synthesis time Better for DSE, and run-time memory

56 56 Use-case Merging Use-case A Use-case B Proc 0 Proc 1 Proc 0 Proc 1 Proc 2 Proc 3 Proc 2 Merged Design Proc 0 Proc 1 Proc 3 Proc 2

57 57 Use-case Partitioning Use-case

58 58 Use-case Merging and Partitioning Results Random Graphs Mobile Phone Without Reduction With Reduction # Partitions Time (ms) # Partitions Time (ms) Without Merging Greedy Out of Memory Out of Memory First-Fit Without Merging Greedy 112 3, First-Fit Optimal Partitions > Reduction Factor

60 60 Dynamism in Applications Multimedia applications are often dynamic SDF assumes worst-case-execution-time not realistic Analysis results may be pessimistic lead to waste of resources & energy Dynamic execution time may lead to unpredictable application performance

61 61 Unpredictability Variation in Execution Time A B P1 P2 A B P3 t 0 t 1 Steady t 2 t 3 State

62 62 Resource Manager Budget enforcement When running, each application signals RM when it completes an iteration RM keeps track of each application s progress Operation modes Polling mode Interrupt mode Suspends application if needed

63 63 Budget Enforcement (Polling) Resource Manager New job enters! job job suspended! resumed! Performance goes down! Better than required!

64 65 Performance without Resource Manager

65 66 Performance with RM I (2.5m cycles)

66 67 Performance with RM II (0k cycles)

67 68 Conclusions Modern multimedia systems support a number of applications executing concurrently. A number of challenges remain for designers Probabilistic performance prediction presented for multiple applications executing concurrently The approach is fast, yet accurate: ideal for DSE A design methodology is proposed that take application(s) specification and generates the MPSoC platform Handle multiple use-cases by merging and partitioning Resource manager presented: admission control and budget enforcement

68 69 Future Work Support for hard real-time applications: both analysis and design-flow Provide soft real-time guarantee: analysis Mixing hard and soft real-time tasks Extend MAMPS to CSDF, SADF models Achieving predictability in suspension Considering the use-case usage when partitioning them

69 70 Relevant Publications Journals (first author) Akash Kumar et al. Multi-processor Systems Synthesis for Multiple Use-Cases of Multiple Applications on FPGA. Transactions on Design Automation in Electronic Systems (ToDAES), ACM. Akash Kumar et al. Analyzing Composability of Applications on MPSoC Platforms, Journal of Systems Architecture (JSA), Elsevier. Akash Kumar et al. Iterative Probabilistic Performance Prediction for Multi-Application Multi-Processor Systems, Transactions on Computer Aided Design (TCAD), IEEE.

70 71 Relevant Publications Conferences (first author) Akash Kumar et al. Global Analysis of Resource Arbitration for MPSoC. Digital Systems Design (DSD), IEEE. Akash Kumar et al. Resource Manager for Non-preemptive Heterogeneous Multiprocessor System-on-chip. Embedded Systems for Real-Time Multimedia (Estimedia) IEEE. Akash Kumar et al. An FPGA Design Flow for Reconfigurable Network-Based Multi-Processor Systems-on-Chip. Design Automation and Test in Europe (DATE), IEEE. Akash Kumar et al. A Probabilistic Approach to Model Resource Contention for Performance Estimation of Multi-featured Media Devices, Design Automation Conference (DAC), ACM/IEEE. Akash Kumar et al. Multi-processor System-level Synthesis for Multiple Applications on Platform FPGA, Field Programmable Logic (FPL), IEEE.