A DoE/RSM-based Strategy for an Efficient Design Space Exploration targeted to CMPs

A DoE/RSM-based Strategy for an Efficient Design Space Exploration targeted to CMPs Gianluca Palermo, Cristina Silvano, Vittorio Zaccaria Politecnico di Milano -Dipartimento di Elettronica e Informazione E-mail: {gpalermo, silvano, zaccaria}@elet.polimi.it Abstract. Application-specific MPSoCs are usually designed by using a platform-based approach, where a wide range of customizable parameters must be tuned to find the best trade-offs in terms of the selected figures of merit (such as energy, delay and area). This optimization phase is called Design Space Exploration (DSE) and it generally consists of a Multi- Objective Optimization (MOO) problem with multiple constraints. In this paper, an efficient DSE methodology for application-specific MP- SoC is presented. The methodology is efficient since it allows determining a suitable set of candidate architectures with as few system simulations as possible, combining Design of Experiments (DoEs) and Response Surface Modeling (RSM) strategies. 1 1 Introduction Customizable MPSoCs supported by parallel programming represent an emerging computing paradigm for application-specific processors. In fact, they represent the best compromise in terms of a stable hardware platform which is software programmable, thus customizable, upgradeable and extensible. In this sense, the MPSoC paradigm minimizes the risk of missing the time-to-market deadline while allowing for greater efficiency due to architecture customization and software compilation techniques. For these architectures, the platform-based design approach [1] is widely used to design application-specific architectures meeting time-to-market constraints. In this scenario, configurable simulation models are used to accurately tune the on-chip architectures and to meet the target application requirements in terms of performance, battery lifetime and area. The Design Space Exploration (DSE) phase is used to tune the configurable system parameters and it generally consists of a multi-objective optimization problem. The DSE problem consists of exploring a large design space consisting of several parameters at system and micro-architectural levels. Although several heuristic techniques have been proposed to address this problem so far, they are all characterized by low efficiency to identify the Pareto front of feasible solutions. Evolutionary or sensitivity based algorithms are among the most notable, state-of-the art techniques [2 4]. 2 An application specific DSE methodology In this paper, we present an application-specific design space exploration strategy leveraging Design of Experiments (DoE) and Response Surface Modeling 1 This work was supported in part by the EC under grant MULTICUBE FP7-216693

(RSM) techniques. Once the objective functions associated to the system have been identified, the proposed methodology allows the efficient identification of an approximate Pareto sets of candidate architectures by evaluating as few system configurations as possible. This is a notable achievement, since, nowadays, evaluating the objective function f(x) of a single system configuration x (being it either performance or power consumption) means hours or days of simulations under a realistic workload for complex SoCs. DESIGN OF EXPERIMENTS. The term Design of Experiments (DoE) [5] is used to identify the planning of an information-gathering experimentation campaign where a set of variable parameters can be tuned. In this paper, we define an experiment as an actual simulation of the target system. The reason for DoEs is that very often the designer is interested in the effects of some parameter s tuning on the system response. Design of experiments is a discipline that has very broad application across natural and social sciences and encompasses a set of techniques whose main goal is the screening and analysis of the system behavior with a small number of simulations. Each DoE plan differs in terms of the layout of the selected design points in the design space. Although several design of experiments have been proposed in the literature so far, we used here the most traditional DoEs which we will leverage in the construction of our efficient design space exploration methodology: Random. In this case, design space configurations are picked up randomly by following a Probability Density Function (PDF). In our methodology, we will use a uniformly distributed PDF. Full factorial. In statistics, a factorial experiment [5] is an experiment whose design consists of two or more parameters, each with discrete possible values or levels, and whose experimental units take on all possible combinations of these levels across all such parameters. Such an experiment allows studying the effects of each parameter on the response variable, as well as the effects of interactions between parameters on the response variable. In this paper, we consider a 2-level full factorial DoE, where the only levels considered are the minimum and maximum for each parameter. Central composite design. A Central Composite Design [5] is an experimental design specifically targeted to the construction of response surfaces of the second order (quadratic) without requiring a three-level factorial DoE. Box-Behnken. The Box-Behnken design [5] is suitable for quadratic models where parameter combinations are at the center of the edges of the process space plus a design with all the parameters at the center. The primary advantage is that the parameter combinations avoid extreme values taken at the same time (in contrast with the central composite design). RESPONSE SURFACE METHODS. Response Surface Modeling techniques allow determining an analytical dependence between several design parameters and one or more response variables. The working principle of RSM is to use a set of simulations generated by DoE in order to obtain a response model. A typical RSM flow involves a training phase, in which known data (or training set) is used to identify the RSM configuration, and a prediction phase in which the RSM is used to forecast unknown system response. RSMs are an effective tool for analytically predicting the behavior of the system platform without resorting to a system simulation; they represent the core of the presented methodology. The RSM models that used in the presented methodology are:

Linear regression. Linear regression is a regression method that models a linear relationship between a dependent response function f and some independent variables x i, i = 1 p plus a random term ε. In this work we apply regression by taking into account also the interaction between the parameters as well as quadratic behavior with respect to a single parameter. Shepard s interpolation. The Shepard s technique is a well known method for multivariate interpolation. This technique is also called inverse distance weighting (IDW) method because the value of the response function in unknown points is the sum of the value of the response function in known points weighted with the inverse of the distance. Artificial Neural Networks Artificial neural networks (ANNs) [6] represent a powerful and flexible method for generalized non-linear regression. The ANN approximation function f is defined, recursively, as a function of other, linearly combined functions f i : ( ) f(x) = Θ w i f i (x) (1) The function Θ is called the activation function while the coefficients w i are called weights. Functions f i can be recursively defined as in Equation 1 in order to create a layered structure. Radial Basis Functions Radial basis functions (RBF) are a widely used interpolation/approximation model [7]. The interpolation function is built on a set of training configurations x j as follows: f(x) = i n λ j φ( x x j ) (2) j=1 where φ is a scalar distance function, λ j are the weights of the RBF and n is the number of samples in the training set. THE PROPOSED DESIGN FLOW. The proposed strategy is called Response Surface-based Pareto Iterative Refinement (ReSPIR). It is based on the concept of iterative refinements of the approximate Pareto set by using predictions given by RSM model. The methodology is parametric in terms of DoE and RSM technique, as well as the maximum number of simulations to be run (see Algorithm 1). Initially (step 2), the DoE plan is used to pick up the set of initial configurations corresponding to the plan of simulations to be run. This step provides an initial coarse view of the target design space, by running the simulations to obtain the actual measurements f associated with F 0. In the successive steps, F 0 represents the archive containing significant information about all the architectural configurations simulated so far. At the first iteration, provided that the maxnsim value is greater than the DoE size, condition in step 5 is met and the while loop body is entered. The RSM technique (step 7) is thus trained with the current archive F 0. The response surface model generates a prediction archive R 0 which is then filtered for Pareto configurations in step 8. Successively (step 9) the simulations associated with the Pareto set R 1 are run; the result is put into the intermediate archive F 1 and the coverage with respect to F 0 is computed. The algorithm iterates until either this

Algorithm 1 The RSM-Supported Iterative Pareto Refinement Design Space Exploration Flow Require: DOE, RSM, maxnsim 1: nsim = 0 2: Generate and run the simulations from DOE. Update nsim accordingly. Put results into F 0. 3: cov = 100% 4: F 1 = {} 5: while (cov > 0) (nsim < maxnsim) do 6: F 0 = F 0 F 1 7: Train RSM with the content of F 0 and compute a prediction R 0, x X 8: R 1 = Ψ (R 0) 9: Generate and run the simulations associated with the configurations in R 1. Update nsim accordingly. Put results into F 1. 10: cov = χ(f 1, F 0) 11: end while 12: return Ψ(F 0) by pruning non-feasible configurations. coverage value reaches 0 or the number of simulations has reached the maximum maxnsim. In the case of reiteration, the freshly generated Pareto points in F 1 are merged with F 0 to improve the prediction accuracy of the RSM. Finally (step 12), F 0 is Pareto filtered by pruning all the non-feasible configurations; the resulting archive is the approximate solution to our Design Space Exploration problem. 3 Validation of ReSPIR To validate the presented ReSPIR methodology, we applied it to the customization of a symmetric shared-memory multiprocessor architecture for the execution of a set of standard benchmarks derived by the SPLASH-2. Also in this case, we focused our analysis on the architectural parameters listed in Table 1 which constitute a design space consisting of X = 2 17 alternative configurations. To carry out the system metrics evaluation (execution time and energy consumption), we leveraged the Sesc [8] simulation tool, a fast simulator for chip-multiprocessor architectures that is able to provide energy and performance results for a given application. Within Sesc, the energy consumption computation is supported by CACTI [9] and WATTCH models [10]. In order to give a fair comparison of ReSPIR other state-of-the-art heuristics, we introduce a Multi-Objective Simulated Annealing (MOSA) derived from [11] and a Multi-Objective Genetic Algorithm (NSGA-II) derived from [12]. Each of these heuristics is parametrized in terms of variables such as the initial population, the number of iteration steps, the set of permutation probabilities and other algorithm specific parameters. Generally, calibrating these parameters is a very difficult task which depends strongly on the problem domain; moreover, since the heuristics are inherently random 2, each combination of heuristics variables should be evaluated more than once in order to infer a more general trend. The number of runs for each algorithm is set such that the actual performance of the 2 this is true also for ReSPIR whenever we use a random DoE or a neural network where the initial weight are chosen randomly

Table 1. Design space for the shared-memory multi-processor platform Parameter Min. Max. # Processors 2 16 Processor issue width. 1 8 L1 instruction cache size 2K 16K L1 data cache size 2K 16K L2 private cache size 32K 256K L1 instruction cache assoc. 1w 8w L1 data cache assoc. 1w 8w L2 private cache assoc. 1w 8w I/D/L2 block size 16 32 algorithm (in terms of ADRS, Average Distance from Reference Set) reaches an average asymptotic value. Practically speaking, this resulted into more than a hundred evaluations for each heuristic. We underline that we are focused on obtaining a good approximation of the exact Pareto set (ADRS 1%) by executing as few simulations as possible, i.e., by simulating less than 3.5% of the entire design space. As a consequence, each strategy has been run by considering an upper bound on the number of simulations which is 3.5% of the entire design space; the resulting Pareto front has been validated against the reference, exact Pareto front of the target architecture 3. Concerning ReSPIR, we focus on the overall performance by presenting a collapsed view for all the combinations of DoE and RSMs without breaking-out the actual algorithm performance for each DoE and RSM. This is due to the fact that what we ant demonstrate is the goodness of the presented ReSPIR exploration strategy and not the goodness of a particular DoE or RSM. Figures 1(a), 1(b) and 1(c) show the average ADRS of the approximated Pareto fronts with respect to the exact Pareto, by varying the size of the design space analyzed. The figures show also the estimated ADRS standard deviation. We can note that, the MOSA algorithm is the worst heuristic in terms of average ADRS, starting from 18% for 1% of the design space and decreasing to 10% for 2.5% of the design space. The NSGA-II and ReSPIR reach, respectively, 5% and 2.5% for the same percentage of design space. Also from the point of view of the standard deviation, the MOSA algorithm is running behind the NSGA-II reaching a 4% at the upper bound of the design space, where the NSGA-II and ReSPIR obtain around 0.5%. 4 Conclusions In this paper, we presented ReSPIR a design space exploration methodology that leverages the traditional DoE paradigm and RSM techniques combined with a powerful way of considering customized application constraints. The design of experiments phase generates an initial plan of experiments which are used to create a coarse view of the target design space; then a set of response surface extraction techniques are used to identify non-feasible configurations and refine the Pareto configurations. This process is repeated iteratively until a target criterion, e.g. number of simulations, is satisfied. 3 The reference Pareto front has been computed with a full-search algorithm, thus it is the exact Pareto front.

0.02 0.2 0.18 0.16 "mean(adrs)" "std(adrs)" 0.07 0.06 "mean(adrs)" "std(adrs)" 0.055 0.05 0.045 "mean(adrs)" "std(adrs)" 0.05 0.04 0.14 0.035 ADRS 0.12 0.1 ADRS 0.04 0.03 ADRS 0.03 0.025 0.08 0.02 0.02 0.015 0.06 0.04 0.01 0.01 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 Percentage of design space analyzed (a) MOSA 0 0.01 0.015 0.02 0.025 0.03 0.035 0.04 Percentage of design space analyzed (b) NSGA-II 0 0.01 0.015 0.02 0.025 0.03 0.035 0.04 Percentage of design space analyzed (c) ReSPIR Fig. 1. Average ADRS (with standard deviation) and percentage of the design space analyzed by (a) MOSA, (b) NSGA-II and (c) ReSPIR References 1. K. Keutzer, S. Malik, A. R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli. System level design: Orthogonolization of concerns and platform-based design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 19(12):1523 1543, December 2000. 2. Gianluca Palermo, Cristina Silvano, and Vittorio Zaccaria. Multi-objective design space exploration of embedded system. Journal of Embedded Computing, 1(3):305 316, 2006. 3. Giuseppe Ascia, Vincenzo Catania, Alessandro G. Di Nuovo, Maurizio Palesi, and Davide Patti. Efficient design space exploration for application specific systemson-a-chip. Journal of Systems Architecture, 53(10):733 750, 2007. 4. Giovanni Beltrame, Dario Bruschi, Donatella Sciuto, and Cristina Silvano. Decision-theoretic exploration of multiprocessor platforms. In Proceedings of CODES+ISSS: International Conference on Hardware-Software Codesign and System Synthesis, pages 205 210, 2006. 5. T. J. Santner, Williams B., and Notz W. The Design and Analysis of Computer Experiments. Springer-Verlag, 2003. 6. C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 2002. 7. M. J. D. Powell. The theory of radial basis functions. In Advances in Numerical Analysis II: Wavelets, Subdivision, and Radial Basis Functions, W. Light (ed, pages 105 210. University Press, 1992. 8. Jose Renau, Basilio Fraguela, James Tuck, Wei Liu, Milos Prvulovic, Luis Ceze, Smruti Sarangi, Paul Sack, Karin Strauss, and Pablo Montesinos. SESC simulator, January 2005. http://sesc.sourceforge.net. 9. S. Wilton and N. Jouppi. CACTI:An Enhanced Cache Access and Cycle Time Model. volume 31, pages 677 688, 1996. 10. David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In Proceedings ISCA 2000: International Symposium on Computer Architecture, pages 83 94, 2000. 11. Jaszkiewicz A. Czyak P. Pareto simulated annealing - a metaheuristic technique for multiple-objective combinatorial optimisation. Journal of Multi-Criteria Decision Analysis, (7):34 47, April 1998. 12. K. Deb, S. Agrawal, A. Pratab, and T. Meyarivan. A Fast and Elitist Multi- Objective Genetic Algorithm: NSGA-II. Proceedings of the Parallel Problem Solving from Nature VI Conference, pages 849 858, 2000.