Throughput constraint for Synchronous Data Flow Graphs

Similar documents
Optimizing Configuration and Application Mapping for MPSoC Architectures

Predictable Mapping of Streaming Applications on Multiprocessors

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

Software Synthesis from Dataflow Models for G and LabVIEW

A CP Scheduler for High-Performance Computers

Institut d Electronique et des Télécommunications de Rennes. Equipe Image

fakultät für informatik informatik 12 technische universität dortmund Data flow models Peter Marwedel Informatik 12 TU Dortmund Germany

Contents. System Development Models and Methods. Design Abstraction and Views. Synthesis. Control/Data-Flow Models. System Synthesis Models

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

6.852: Distributed Algorithms Fall, Class 2

Real-Time Operating Systems for MPSoCs

Low-Overhead Hard Real-time Aware Interconnect Network Router

FPGA-based Multithreading for In-Memory Hash Joins

Programma della seconda parte del corso

Instruction scheduling

An Interactive Visualization Tool for the Analysis of Multi-Objective Embedded Systems Design Space Exploration

Dynamic Network Resources Allocation in Grids through a Grid Network Resource Broker

Multiprocessor System-on-Chip

Dynamic programming. Doctoral course Optimization on graphs - Lecture 4.1. Giovanni Righini. January 17 th, 2013


Software Pipelining - Modulo Scheduling

Interconnection Networks

Evaluation of Different Task Scheduling Policies in Multi-Core Systems with Reconfigurable Hardware

RETIS Lab Real-Time Systems Laboratory

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Global Multiprocessor Real-Time Scheduling as a Constraint Satisfaction Problem

A Constraint Programming based Column Generation Approach to Nurse Rostering Problems

<Insert Picture Here> Oracle In-Memory Database Cache Overview

Big Data looks Tiny from the Stratosphere

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Hardware/Software Codesign

Reference Architecture, Requirements, Gaps, Roles

MULTICORE PROCESSORS AND SYSTEMS: A SURVEY

Static Load Balancing of Parallel PDE Solver for Distributed Computing Environment

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams

Optimized Scheduling in Real-Time Environments with Column Generation

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Scalability and Classifications

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

Parallel Programming Survey

GameTime: A Toolkit for Timing Analysis of Software

Study Plan Masters of Science in Computer Engineering and Networks (Thesis Track)

System Behaviour Analysis with UML and Ptolemy. Scope and goals

Multi-core real-time scheduling

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Principles and characteristics of distributed systems and environments

Parallel Programming

DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH

Load Balancing Techniques

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

ESQUIVEL S.C., GATICA C. R., GALLARD R.H.

Ensuring Code Quality in Multi-threaded Applications

Lecture 2 Introduction to Data Flow Analysis

Mixed-Criticality: Integration of Different Models of Computation. University of Siegen, Roman Obermaisser

A Virtual Machine Searching Method in Networks using a Vector Space Model and Routing Table Tree Architecture

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

A Tool for Generating Partition Schedules of Multiprocessor Systems

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Parallel Firewalls on General-Purpose Graphics Processing Units

Job Reference Guide. SLAMD Distributed Load Generation Engine. Version 1.8.2

Data Backup and Archiving with Enterprise Storage Systems

Parametric Analysis of Mobile Cloud Computing using Simulation Modeling

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

A Hardware-Software Cosynthesis Technique Based on Heterogeneous Multiprocessor Scheduling

COMPUTER- INTEGRATED PRODUCTION PLANNING AND CONTROL: THE OPT APPROACH. B. Sko³ud*, D. Krenczyk*, W. WoŸniak**

BSC vision on Big Data and extreme scale computing

Driving force. What future software needs. Potential research topics

Real-Time (Paradigms) (51)

Designing Real-Time and Embedded Systems with the COMET/UML method

Introduction to Scheduling Theory

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Bogdan Vesovic Siemens Smart Grid Solutions, Minneapolis, USA

MODEL DRIVEN DEVELOPMENT OF BUSINESS PROCESS MONITORING AND CONTROL SYSTEMS

Parallel Processing over Mobile Ad Hoc Networks of Handheld Machines

Transcription:

Throughput constraint for Synchronous Data Flow Graphs *Alessio Bonfietti Michele Lombardi Michela Milano Luca Benini!"#$%&'()*+,-)./&0&20304(5 60,7&-8990,.+:&;/&.<+&=>"!?@A>&"'&=,0B+C.

!"#$%&'()* Resource allocation and Scheduling of an application modeled with a Synchronous Data-Flow Graph given a Throughput Bound Embedded Systems Multimedia Systems MP-SoC (MultiProcessor-System-on-Chip) Multimedia Applications Stream Computing based on Data-Flow Model Throughput Synchronous Data-Flow 2

+,-*./'-'0%2"&"34#'5%6/"(. Node (Actors) - Task 3 D Edge - Communication Channel 2 2 Token - Data Concurrency A 2 2 3 B Execution Constraints (Rate) Periodic Behaviour (Repetition Vector) Repetition Vector : [3,2,2,] 3 C An actor fires when there are enough tokens on all input arcs.!"#$%&'(%")%"#*+,%-.,(/0 23 #+)/4'(/%5)6!"#$%*+7()/%8+,%("#$%5)9+5)9%#$"))(: ;3 <,+=4#(/%+4*6!"#$%*+7()/%8+,%("#$%+4*9+5)9%#$"))(: 3

7'8'9:-:'0%+,-*./'-'0%2"&"34#'5%6/"(. HSDFG HSDFG Homogeneous Rate (all rates are ) Single Execution of each Task A B Higher number of nodes A2 Property: Preserves the Throughput Transformation process is based on the Repetition Vector (eg: [2,]) SDFG Throughput of SDF graph = A 2 B Throughput of HSDF graph 4

!./'09.(0& To compute the throughput of the HSDF graph:! Find all Cycles in the graph! For each Cycle c compute the total execution time over the number of tokens of c! Throughput = / the maximum of these values The throughput is the average number of actor firings over time 5

;-(0&%<%=0&(0& Input Output SDF Graph # Tasks # Channels # Tokens Architecture # Processors Allocation Bind each task with a specific processor Schedule Order the execution of tasks on each processor Homogeneous Architecture 6

!"#$%"&'()*+ More than two decades of work on SDFG mapping. Heuristic Approach eg: Periodic Admissible Sequential Schedule (PASS) [Lee 87] eg: SDF-3 tool [Geilen,Stuijk 06] Single-Core PASS Multi-Core Complete Approach based Heuristic SDF-3...... on Constraint Programming Complete Our Work 7

>'-&/")-&%?/'9/"88)-9 Constraint programming is a problem-solving methodology Solve Hard Combinatorial Problems Model Variables Finite Domain: set of values that a variable can assume Constraint: Filtering Algorithm Domain Reduction 8

>?@%A"/)"B#:%C%>'-&/")-&%C%+:"/*. Solving Consistency Constraint Propagation: reduction of the domain of the variables to prevent search to find an infeasible solution Search Solve model : define/choose search algorithm define/choose heuristics once problem is modeled using constraints, wide selection of solution techniques available 9

,)&"#'-'.$*/$0#" Idea: Model the effects of decisions by means of modifications to the graph Allocation Variables Graph Variables P i [0..#processor-] Scheduling Variables Arcsij [0,] Next i [-,0..task-] 0

,)&"#'-'2)3%*$/3% Idea: Model the effects of decisions by means of modifications to the graph HSDFG Order Proc A B A,B Edge Constraints P[] Proc 2 C D C,D Order Constraints Next[] Deterministic Behaviour

4)#5"* Tree-Search Strategy : 2 Phases ) Task Allocation P[] 2) Ordering of tasks on processors Next[] Throughput Constraint Static Symmetry Breaking Constraint 2

67*)897:8%'2)3%*$/3% Thr_cst(P, Next, Arcs, Thr_Bound) W: Execution Time ψ: Number of Token Level k Task v A B C D 0 0 A HSDFG B D 2 3 4 C Execution times: [,,,] 3

67*)897:8%'2)3%*$/3% Thr_cst(P, Next, Arcs, Thr_Bound) W: Execution Time ψ: Number of Token Level k Task v A B C D 0 0 A HSDFG B D 2 3 4 2 C Execution times: [,,,] 3

67*)897:8%'2)3%*$/3% Thr_cst(P, Next, Arcs, Thr_Bound) W: Execution Time ψ: Number of Token Level k Task v A B C D 0 0 A HSDFG B D 2 2 2 3 3 C 4 4 Execution times: [,,,] 3

67*)897:8%'2)3%*$/3% A. Dasdan, R. K. Gupta [Das98] Improvements: R.M.Karp [Karp66] Level k A B C D 0 0 2 3 Task v 2 2 3 ) 2) 3) Take into account Tokens The original algorithm was devised to count a token for each arc. Generalized for not Strongly Connected Graphs Most Throughput computation algorithms target S.C. graphs. Cycle Identification at each step, instead at the end of the algorithm 4 4 4

67*)897:8%'2)3%*$/3% Algorithm Step Longer Cycles! Throughput Lower bound "+25$,62--!,6+-27/+5% Throughput Lower bound "+25$,62--!,6+-27/+5% Throughput 6+-27/+5% Throughput Value Decreasing (0+2'*23 -+4%0,"+25$ " #" # $ $ % % & & Throughput 6+-27/+5% ' ' (%)*+,$%--!,./#%0#! Algorithm Step Fast Pruning 5

4)#5"*'4"$*;7 Search Step Longer Cycles Throughput Value Decreasing Fast Pruning! Throughput Upper Lower Bound bound "+25$,62--!,6+-27/+5% -+4%0,"+25$ (0+2'*23 " # $ % & Throughput Solution 6+-27/+5% ' (%)*+,$%--!,./#%0#! Algorithm Search Step Step 6

4)#5"*'4"$*;7 Search Step Longer Cycles Throughput Value Decreasing Fast Pruning! Throughput Upper Lower Bound bound "+25$,62--!,6+-27/+5% -+4%0,"+25$ (0+2'*23 " # $ % & Throughput Solution 6+-27/+5% ' (%)*+,$%--!,./#%0#! Algorithm Search Step Step 6

4)#5"*'4"$*;7 Search Step Longer Cycles Throughput Value Decreasing Fast Pruning! Throughput Upper Lower Bound bound "+25$,62--!,6+-27/+5% -+4%0,"+25$ (0+2'*23 " # $ % & Throughput Solution 6+-27/+5% ' (%)*+,$%--!,./#%0#! Algorithm Search Step Step 6

4)#5"*'4"$*;7 Search Step Longer Cycles Throughput Value Decreasing Fast Pruning! Throughput Upper Lower Bound bound "+25$,62--!,6+-27/+5% -+4%0,"+25$ (0+2'*23 " # $ % & Throughput Solution 6+-27/+5% ' (%)*+,$%--!,./#%0#! Algorithm Search Step Step 6

4)#5"*'4"$*;7 Search Step Longer Cycles Throughput Value Decreasing Fast Pruning! Throughput Upper Lower Bound bound "+25$,62--!,6+-27/+5% -+4%0,"+25$ (0+2'*23 " # $ % & Throughput Solution 6+-27/+5% ' (%)*+,$%--!,./#%0#! Algorithm Search Step Step 6

<8*%7"*'=:>?/@$>)3 Caching System Provides some incrementality Trivial Bound Total execution time on a Processor PI PII Processor Pruning Exclude from the computation processors with no in-arcs/out-arcs. PIII PIV 7

6"% Implemented on ILOG Solver 6.3 Instance structure Cyclic Graph Architectures: 2 Core Acyclic Graph 4 Core Strongly Connected Graph 8 Core 8

Run Time 6"% Optimization Improvement Algorithm Not Optimized Algorithm Optimized 9

6"% Optimal Solution # Processors # nodes Cyclic Acyclic Str. Connected Search Time Constr. Time Search Time Constr. Time Search Time Constr. Time 0 0,02 0,0 0,04 0,02 0,02 0,0 2 5,88,33 2,66 0,85 0,92 0,53 20 308,5 77,89 46,89 08,06 44,5 82,3 0 0,0 0 0,04 0,02 0 0 4 5 2,4,65,85,35 0,06 0,04 20 0,2989 0,863 43,39 293,38 0,89 0,66 0 0,0 0 0,02 0,0 0,0 0 8 5 0,02 0,0,37,09 0,02 0 20 0,27 0,9 207,2 69,24 0,06 0,02 20

2)3;#8/)3 CP-base method for allocating and scheduling HSDFGs on multiprocessor platforms. It can be used to find the optimal solution, or to find a feasible solution Future & Present Work Search directly on SDF graph. Take into account Latency Communication, Memory Capacity Constraints.. 2