On-Chip Memory Architecture Exploration of Embedded System on Chip

Transcription

1 On-Chip Memory Architecture Exploration of Embedded System on Chip A Thesis Submitted for the Degree of Doctor of Philosophy in the Faculty of Engineering by T.S. Rajesh Kumar Supercomputer Education and Research Centre Indian Institute of Science Bangalore September 2008

2

3 To my Family, Sree, Amma, Advika and Adarsh

4

5 Abstract Today s feature-rich multimedia products require embedded system solution with complex System-on-Chip (SoC) to meet market expectations of high performance at low cost and lower energy consumption. SoCs are complex designs with multiple embedded processors, memory subsystems, and application specific peripherals. The memory architecture of embedded SoCs strongly influences the area, power and performance of the entire system. Further, the memory subsystem constitutes a major part (typically up to 70%) of the silicon area for the current day SoC. The on-chip memory organization of embedded processors varies widely from one SoC to another, depending on the application and market segment for which the SoC is deployed. There is a wide variety of choices available for the embedded designers, starting from simple on-chip SPRAM based architecture to more complex cache-spram based hybrid architecture. The performance of a memory architecture also depends on how the data variables of the application are placed in the memory. There are multiple data layouts for each memory architecture that are efficient from a power and performance viewpoint. Further, the designer would be interested in multiple optimal design points to address various market segments. Hence a memory architecture exploration for an embedded system involves evaluating a large design space in the order of 100,000 of design points and each design points having several tens of thousands of data layouts. Due to its large impact on system performance parameters, the memory architecture is often hand-crafted by experienced designers exploring a very small subset of this design space. The vast memory design space prohibits any possibility for a manual analysis. In this work, we propose an automated framework for on-chip memory architecture

6 exploration. Our proposed framework integrates memory architecture exploration and data layout to search the design space efficiently. While the memory exploration selects specific memory architectures, the data layout efficiently maps the given application on to the memory architecture under consideration and thus helps in evaluating the memory architecture. The proposed memory exploration framework works at both logical and physical memory architecture level. Our work addresses on-chip memory architecture for DSP processors that is organized as multiple memory banks, with each back can be a single/dual port banks and with non-uniform bank sizes. Further, our work also address memory architecture exploration for on-chip memory architectures that is SPRAM and cache based. Our proposed method is based on multi-objective Genetic Algorithm based and outputs several hundred Pareto-optimal design solutions that are interesting from a area, power and performance viewpoints within a few hours of running on a standard desktop configuration.

7 Acknowledgments There are many people I would like to thank who have helped me in various ways. First and foremost I would like to thank my Supervisors, Prof. R. Govindarajan and Dr.C.P. Ravikumar, who have guided me and supported me in various aspects through the entire journey in completion of my thesis work. I profusely thank for the encouragement they provided and their perseverance in keeping me focused on the Ph.D. work. I would like to express my gratitude to Texas Instruments for giving me the time and opportunity to pursue my studies. I would like to thank my colleagues at Texas Instruments for their support and reviews. In particular my manager Balaji Holur. I would also like to thank my previous managers Pamela Kumar and Manohar Sambandam. Last but not the least, I would like to thank my dearest family members for the encouragement they provided and the sacrifices they made to help me achieve my goals.

8 iv

9 Contents Abstract Acknowledgments i iii List of Publications from this Thesis 1 1 Introduction Application Specific Systems Memory Subsystem On-chip Memory Organization Cache-based Memory Organization Scratch Pad Memory-based Organization Data Layout Memory Architecture Exploration Embedded System Design Flow Contributions Thesis Overview Background On-chip Memory Architecture of Embedded Processors DSP On-chip SPRAM Architecture Microcontroller Memory Architecture Software Optimizations

10 vi DSP Software Optimizations MCU Software Optimizations Cache Based Embedded SOC Cache-SPRAM Based Hybrid On-chip Memory Architecture Genetic Algorithms - An Overview Multi-objective Multiple Design Points Data Layout for Embedded Applications Introduction Method Overview and Problem Statement Method Overview Problem Statement ILP Formulation Basic Formulation Handling Multiple Memory Banks Handling SARAM and DARAM Overlay of Data Sections Swapping of Data Genetic Algorithm Formulation Heuristic Algorithm Data Partitioning into Internal and External Memory DARAM and SARAM placements Experimental Methodology and Results Experimental Methodology Integer Linear Programming - Results Heuristic and GA Results Comparison of Heuristic Data Layout with GA Comparison of Different Approaches Related Work Conclusions

11 vii 4 Logical Memory Exploration Introduction Method Overview Memory Architecture Parameters Memory Architecture Exploration Objectives Memory Architecture Exploration and Data Layout Genetic Algorithm Formulation GA Formulation for Memory Architecture Exploration Pareto Optimality and Non-Dominated Sorting Simulated Annealing Formulation Memory Subsystem Optimization Experimental Results Experimental Methodology Experimental Results Related Work Conclusions Data Layout Exploration Introduction Problem Definition MODLEX: Multi Objective Data Layout EXploration Method Overview Mapping Logical Memory to Physical Memory Genetic Algorithm Formulation Experimental Results Experimental Methodology Experimental Results Comparison of MODLEX and Stand-alone Optimizations Related Work Conclusions

12 viii 6 Physical Memory Exploration Introduction Logical Memory Exploration to Physical Memory Exploration (LME2PME) Method Overview Physical Memory Exploration Genetic Algorithm Formulation Direct Physical Memory Exploration (DirPME) Framework Method Overview Genetic Algorithm Formulation Experimental Methodology and Results Experimental Methodology Experimental Results from LME2PME Experimental Results from DirPME Comparison of LME2PME and DirPME Related Work Conclusions Cache Based Architectures Introduction Solution Overview Data Partitioning Heuristic Cache Conscious Data Layout Overview Graph Partitioning Formulation Cache Offset Computation Experimental Methodology and Results Experimental Methodology Cache-Conscious Data Layout Cache-SPRAM Data Partitioning Memory Architecture Exploration

13 ix 7.6 Related Work Cache Conscious Data Layout SPRAM-Cache Data Partitioning Memory Architecture Exploration Conclusions Conclusions Thesis Summary Future Work Standardization of Input and Output Parameters Impact of platform change on system performance Impact of Application IP library rework on system performance Impact of semiconductor library rework on the system performance Multiprocessor Architectures Bibliography 176

14 List of Tables 1.1 Explanation of Xchart Steps List of Symbols Used Memory Architecture for the Experiments Experimental Results Results from Heuristic Placement (HP) and Genetic Placement (GP) on 4 Embedded Applications, VE = Voice Encoder, JP = JPEG Decoder, LLP = Levinson s Linear Predictor, 2D = 2D Wavelet Transform Comparative Ranking of Algorithms Memory Architecture Parameters Evaluation of Multi-Objective Cost Function Memory Architecture Exploration Non-dominant Points Comparison GA-SA Memory Architectures Used for Data Layout Memory Architectures Explored - Using DirPME Approach Non-dominant Points Comparison LME2PME-DirPME Input Parameters for Data Partitioning Algorithm Data Layout Comparison Data Layout for Different Cache Configurations

15 List of Figures 1.1 Architecture of an Embedded SoC Embedded Application Development Flow Memory Trends in SoC Application Specific SoC Design Flow Illustration with X-chart Mapping Chapters to X-chart Steps Example DSP Memory Map Cache-SPRAM Based On-Chip Memory Architecture Genetic Algorithm Flow Overview of Data Layout Illustration of Parallel and Self Conflicts Heuristic Algorithm for Data Layout Relative performance of the Genetic Algorithm w.r.t. Heuristic, for Varying Number of Generations Comparison of Heuristic Data Layout Performance with GA Data layout DSP Processor Memory Architecture Two-stage Approach to Memory Subsystem Optimization Comparison of GA and SA Approaches for Memory Exploration Vocoder Non-dominated Points Comparison Between GA and SA Vocoder: Memory Exploration (All Design Points Explored and Non-dominated Points)

16 xii 4.6 MPEG: Memory Exploration (All Design Points Explored and Non-dominated Points) JPEG: Memory Exploration (All Design Points Explored and Non-dominated Points) DSL: Memory Exploration (All Design Points Explored and Non-dominated Points) MODLEX: Multi Objective Data Layout EXploration Framework Data Layout Exploration: MPEG Encoder Data Layout Exploration: Voice Encoder Data Layout Exploration: Multi-Channel DSL Individual Optimizations vs Integrated Memory Architecture Exploration Memory Architecture Exploration - Integrated Approach Logical to Physical Memory Exploration - Overview Logical to Physical Memory Exploration - Method GA Formulation of LME2PME MAX: Memory Architecture exploration Framework GA Formulation of Physical Memory Exploration Voice Encoder: Memory Architecture Exploration - Using LME2PME Approach MPEG: Memory Architecture Exploration - Using LME2PME Approach DSL: Memory Architecture Exploration - Using LME2PME Approach Voice Encoder (3D view): Memory Architecture Exploration - Using DirPME Approach Voice Encoder: Memory Architecture Exploration - Using DirPME Approach MPEG Encoder: Memory Architecture Exploration - Using DirPME Approach DSL: Memory Architecture Exploration - Using DirPME Approach

17 xiii 7.1 Target Memory Architecture Memory Exploration Framework Example: Temporal Relationship Graph Heuristic Algorithm for Data Partitioning Cache Conscious Data Layout Heuristic Algorithm for Offset Computation AAC: Performance for different Hybrid Memory Architecture MPEG: Performance for different Hybrid Memory Architecture JPEG: Performance for different Hybrid Memory Architecture AAC: Power consumed for different hybrid memory architecture MPEG: Power consumed for different hybrid memory architecture JPEG: Power consumed for different hybrid memory architecture AAC: Non-dominated Solutions MPEG: Non-dominated Solutions JPEG: Non-dominated Solutions

18

19 List of Publications from this Thesis 1. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. On-chip Memory Architecture Exploration Framework for DSP Processor Based Embedded SoC. Submitted to the ACM Transactions on Embedded Computing Systems, May T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Memory Architecture Exploration Framework for Cache-based Embedded SoC. In Proceedings of the International Conference on VLSI Design, Jan T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. MODLEX: A Multi-Objective Data Layout EXploration Framework for Embedded SoC. In Proceedings of the 12th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. MAX: A Multi-Objective Memory Architecture Exploration Framework for Embedded SoC. In Proceedings of the International Conference on VLSI Design, Jan T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Embedded Tutorial on Multi- Processor Architectures for Embedded SoC. In Proceedings of the VLSI Design and Test, Aug T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Optimal Code and Data Layout for Embedded Systems. Design, Jan In Proceedings of the International Conference on VLSI 7. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Memory Exploration for Embedded Systems. In Proceedings of the VLSI Design and Test, Aug 2002.

20 2 List of Publications from this Thesis

21 Chapter 1 Introduction 1.1 Application Specific Systems Today s VLSI technology allows us to integrate tens of processor cores on the same chip along with embedded memories, application specific circuits, and interconnect infrastructure. As a result, it is possible to integrate an entire system onto a single chip. The single chip phone, which has been introduced by several semiconductor vendors, is an example of such a system-on-chip; it includes the modem, radio transceiver, power management functionality, a multimedia engine and security features, all on the same chip. An embedded system is an application-specific system which is optimized to perform a single function or a small set of functions [70]. We distinguish this from a general-purpose system, which is software-programmable to perform multiple functions. A personal computer is an example of a general-purpose system; depending on the software we run on the computer, it can be useful for playing games, word processing, database operations, scientific computation, etc. On the other hand, a digital camera is an example of an embedded system, which can perform a limited set of functions such as taking pictures, organizing them, or transferring them to another device through a suitable I/O interface. Other examples of embedded systems include mobile phones, audio/video players, videogame consoles, settop boxes, car infotainment systems, personal digital assistants, telephone central-office switches, dedicated network routers and bridges. Note that a large number of embedded

22 4 Introduction systems are built for the consumer market. As a result, in order to be competetive, the cost of an embedded system cannot be very high. Yet, the consumers demand higher performance and more features from the embedded systems products. It is easy to appreciate this point if we compare the performance and feature set offered by mobile phones that cost Rs 5000/-(or 100$) today and which cost the same a few years ago. We also see that a large number of embedded systems are being built for the mobile market. This trend is not surprising - the number of mobile phone subscribers increased from 500 Million in year 2000 to 2.6 Billion in 2007 [7]. Because of such high volumes, embedded systems are extremely cost sensitive and their design demands careful silicon-area optimization. Since mobile devices use batteries as the main source of power, embedded systems must also be optimized for energy dissipation. Power, which represents the rate at which energy is consumed, must also be kept low to avoid heating and improving reliability. In summary, the designer of an embedded system must simultaneously consider and optimize price, performance, energy, and power dissipation. Application specific embedded systems designed today demand innovative methods to optimize these system cost functions [11, 19]. Many of today s embedded systems are based on system-on-chip platforms [16], which, in turn, consist of one or more embedded microcontrollers, digital signal processors (DSP), application specific circuits and read-only memory, all integrated into a single package. These blocks are available from vendors of intellectual property (IP) as hard cores or soft cores [42, 28]. A hard core, or hard IP block, is one where the circuit is available at a lower level of abstraction such as the layout-level [42, 28]; it is impossible to customize a hard IP to suit the requirements of the embedded system. As a result, there are limited opportunities in optimizing the cost functions by modifying the hard IP. For example, if some functionality included in the IP is not required in the present application, we cannot remove the function to save area. Soft IP refers to circuits which are available at a higher level of abstraction, such as register-transfer level [28, 42]. It is possible to customize the soft IP for the specific application. The designer of an embedded SoC integrates the IP cores for processors, memories, and application-specific hardware to create the SoC. Figure 1.1 illustrates the architecture of an embedded system-on-chip (SoC). As can

23 1.2 Memory Subsystem 5 be seen in the figure, there are four principal components in such an SoC. 1. An Analog Front End which includes the analog/digital and digital/analog converters 2. Programmable Components which include microprocessors, microcontrollers, and DSPs. The number of embedded processors is increasing every year. An interesting statistic shows that of the nine billion processors manufactured in 2005, less than 2% were used for general-purpose computers. The other 8.8 billion went into embedded systems [13]. The microcontroller/microprocessor is useful in handling interrupts, house-keeping and performing timing related functions. The DSP is useful for processing the audio and video information e.g., compression and decompression of audio and video information. The application software is normally preloaded in the memory and is not user programmable, unlike general-purpose processor-based systems 3. Application-specific components these include hardware accelerators for computeintensive functions. Examples of hardware accelerators include digital image processors which are useful in cameras 1.2 Memory Subsystem On-chip Memory Organization The memory architecture of an embedded processor core is complex and is custom designed to improve run-time performance and power consumption. In this section we describe only on the memory architecture of the DSP processor as this is the focus of the thesis. This is because, the memory architecture of the DSP is more complex than that of microcontrollers (MCU) due to the following reasons: (a) DSP applications are more data dominated than the control-dominated software executed on an MCU. Memory bandwidth requirements for DSP applications range from 2 to 3 memory accesses per

24 6 Introduction Figure 1.1: Architecture of an Embedded SoC processor clock cycle. For an MCU, this figure is, at best, one memory access per cycle. (b) It is critical in DSP application to extract maximum performance from the memory subsystem in order to meet the real-time constraints of the embedded application. As a consequence, the DSP software for critical kernels is developed mostly as hand optimized assembly code. In contrast, the software for MCU is typically developed in high-level languages. The memory architecture for a DSP is unique since the DSP has multiple onchip buses and multiple address generation units to service higher bandwidth needs. The on-chip memory of embedded processors can include (a) only Level-1 cache (L1-cache) (e.g., [1]), (b) only scratch-pad RAM (SPRAM) (e.g., [75, 76], or (c) a combination of L1-cache and SPRAM (e.g., [2, 77]) Cache-based Memory Organization Purely cache-based on-chip memory organization is generally not preferred by embedded system designers as this organization cannot guarantee the worst-case execution time constraints. This is because the access time in a cache based system can vary depending on whether the access results in a cache miss or a hit [33]. As a consequence, the run-time

25 1.2 Memory Subsystem 7 performance of cache-based memory subsystems varies, based on the execution path of application and is data dependent. However cache architecture is advantageous in the sense that it reduces programmer s responsibility in terms of placement of data to achieve better memory access time. Further the movement of data from off-chip memory to cache is transparent. In [12], the authors present a comparison study of SPRAM and cache for embedded applications and conclude that SPRAM has 34% smaller area and 40% lower power consumption than a cache of the same capacity. There is published literature to estimate the worst case execution time [81] and find an upper bound on run-time [78] for cache-based embedded systems. Hence it was argued that for real-time embedded systems which require stringent worst-case performance guarantee, purely cache based on-chip organization is not suitable Scratch Pad Memory-based Organization On-chip memory organization based only on Scratch Pad memory ensures single cycle access times and guarantees on worst-case execution for data that resides in Scratch-Pad RAM (SPRAM). However, it is the responsibility of the programmer to identify data section that should be placed in SPRAM or place code in the program to appropriately move data from off-chip memory to SPRAM. A DSP core can include the following types of memories static RAM (SRAM), ROM, and/or dynamic RAM (DRAM). The scratch pad memory in the DSP core is organized into multiple memory banks to facilitate multiple simultaneous data accesses. A memory bank can be organized as a single-access RAM (SARAM) or a dual-access RAM (DARAM) to provide single or dual access to the memory bank in a single cycle. Also the on-chip memory banks can be of different sizes. Smaller memory banks consume lesser power per access than the larger memories. The embedded system may also be interfaced to off-chip memory, which can include SRAM and DRAM. Purely SPRAM based on-chip organization is suitable only for low to medium complex embedded applications. SPRAM based systems do not use the on-chip RAM efficiently as it requires the entire data sections that are currently accessed to be placed exclusively

26 8 Introduction in the SPRAM. It is possible to accommodate different data sections in SPRAM at different points in execution time by moving data dynamically between off-chip memory and SPRAM. But this results in certain run-time overhead and increase in code size. For medium to large applications, which have large number of critical data variables, a large amount of on-chip RAM will become necessary to meet the real-time performance constraints. Hence for such applications pure SPRAM architecture are not preferred. 1.3 Data Layout To efficiently use the on-chip memory, critical data variables of the application need to be identified and mapped to the on-chip RAM. The memory architecture may contain both on-chip cache and SPRAM. In such a case it is important to partition the data section and assign them appropriately to on-chip cache and SPRAM such that memory performance of the application is optimized. Further, among the data sections assigned to on-chip cache and SPRAM, a proper placement of the data sections on the cache and SPRAM is required to ensure that the cache misses are reduced and the multiple memory banks of the SPRAM and the dual ported SPRAMs are efficiently utilized. Identifying such a data placement for data sections, referred to as the data layout problem, is complex and critical step [10, 53]. This task is typically performed manually as the compiler cannot assume that the code under compilation represents the entire system [10]. The application program in a modern embedded system is complex since it must support a variety of device interfaces such as networking interfaces, credit card readers, USB interfaces, parallel ports, and so on. The application also has many multimedia components like MP3, AAC and MIDI [8]. This necessitates an IP reuse methodology [74], where software modules developed and optimized independently by different vendors are integrated. Figure 1.2 explains the typical flow in embedded application development. This integration is a very challenging job with multiple objectives: (a) it has to be done under tight constraints on time-to-market constraints, (b) it has to be repeated for different variants of SoCs with different custom memory architectures, and (c) it has to perform in such a way that the embedded application is optimized for performance,

27 1.3 Data Layout 9 power consumption and cost. Figure 1.2: Embedded Application Development Flow Since the IPs/modules are independently optimized, the integrator is under pressure to deliver the complete product with the expectation that each component performs at the same level as it did in isolation. This is a major challenge. When a module is optimized independently, the developer has all the resources of the SoC (MIPS and Memory) to optimize the module. When these modules are integrated at the system-level, the system resources are shared among the modules. So the application integrator needs to know the MIPS and memory requirements of the modules unambiguously to be able to allocate the shared resources to critical needs [74]. Usually, the modules memory requirements are given only at a high level. To be able to optimize the whole application/system, the integrator will need detailed memory analysis at the module-level; e.g., which data buffers need to be placed in dual ported memories and which data buffers should not be placed in the same memory bank this data is usually not available. Moreover, the critical code is usually written in low-level assembly language to meet real-time constraints and/or

28 10 Introduction due to legacy reasons. Because of the above mentioned reasons, the application integration/optimization, analyzing the application and mapping software modules in order to obtain optimal cost and performance takes significant amount of time (approximately 1-2 man months). Currently in most of the SoC design data layout is also performed manually and it has two major problems:(1) the development time is significant not acceptable for current-day time to market requirements, (2) quality of solution varies based on the expertise. 1.4 Memory Architecture Exploration In modern embedded systems, the area and power consumed by the memory subsystem is up to 10 times that of the data path, making memory a critical component of the design [11]. Further, the memory subsystem constitutes a large part (typically up to 70%) of the silicon area for the current day SoC and it is expected to go up to 94% in 2014 as shown in the Figure 1.3 [6]. The main reason for this is that embedded memory has a relatively smallsubsystem per-area design cost in terms of both man-power, time-tomarket and power consumption [60]. Hence the memory plays an important role in the design of embedded SoCs. Further the memory architecture strongly influences the cost, performance and power dissipation of an embedded SoC. As discussed earlier, the on-chip memory organization of embedded processors varies widely from one SoC to another, depending on the application and market segment for which the SoC is deployed. There is a wide variety of choices available for the embedded designers, starting from simple on-chip SPRAM based architecture to more complex cache-spram based hybrid architecture. To begin with, the system designer needs to decide if the SoC requires cache and what is the right size of on-chip RAM. Once the high level memory organization is decided, the finer parameters need to be defined to complete the memory architecture definition. For the on-chip SPRAM based architecture, the parameters, namely, size, latency, number of memory banks, number of read/write ports per memory bank and connectivity, collectively define the memory organization and strongly influence the performance, cost, and power consumption. For cache based on-chip RAM,

29 1.4 Memory Architecture Exploration 11 Figure 1.3: Memory Trends in SoC the finer parameters are the size of cache, associativity, line size, miss latency and write policy. Due to its large impact on system performance parameters, the memory architecture is often hand-crafted by the designer based on the targeted applications. However, with the combination of on-chip SPRAM and cache, the memory design space is too large for a manual analysis [31]. Also, with the projected growth in the complexity of embedded systems and the vast design space in memory architecture, hand optimization of the memory architecture will soon become impossible. This warrants an automated framework which can explore the memory architecture design space and identify interesting design points that are optimal from a performance, power consumption and VLSI area (and hence cost) perspective. As the memory architecture design space itself is vast, a brute force design space exploration tool may take large computation time and hence is unlikely to be useful in meeting the tight time-to-market constraint. Further, for each given memory architecture, there are several possible data section layouts which are optimal in terms of performance and power. This further compounds the memory architecture exploration problem.

30 12 Introduction 1.5 Embedded System Design Flow In this section, we present our view of embedded system design flow to set the context for our work. For this purpose, we introduce the notion of the X-chart, which is inspired from the well-known Y-chart introduced by Gajski to capture the process of VLSI system design [29]. In a Y-chart, the three levels of design abstraction form the three dimensions of the figure Y; these are (a) design behavior, (b) design structure and (c) physical aspects of the design. A design flow starts from a behavior specification, which is then mapped to a structure, which in turn is mapped to a physical realization. We can view the process of transforming a behavior to a physical realization as a successive refinement process. Optimization of design metrics such as area, performance, and power are the goals of each of these refinement steps. The design process may spiral from the behavioral axis to structural axis to physical design axis in multiple stepwise refinement steps. We introduce the notion of the X-chart, which is illustrated in Figure 1.4. The X- chart representation has four axes: (a) Behavior, (b) Logical Architecture, (c) Physical Architecture and (d) Software Data Layout. The logical memory architecture (LMA) defines the embedded cache size, cache associativity, cache block size, size of the scratch pad memory, number of memory banks, and the number of ports. The physical memory architecture (PMA) is an actual realization of an LMA using the memory library components provided by the semiconductor vendor. The fourth dimension, namely Software Data Layout, is necessary for capturing the process of embedded system design. We have identified several steps in the embedded system design flow and marked them with circled numbers. Table 1.1 explains the individual steps in the X-chart representation. The design of an embedded system begins with a behavioral description (Point (1) in Figure 1.4, which is shown on the behavioral axis). Today, there are many languages available to capture the system behavior, e.g., System Verilog [5], System C [4], and so on. Hardware-software partitioning is performed to identify which functionalities of the description are best performed in hardware and which are best implemented in software. Hardware implementation is cost-intensive, but improves the performance.

31 1.5 Embedded System Design Flow 13 We show point (2) on the LMA axis, since hardware-software partitioning adds considerable amount of detail to decide the LMA parameters. The next step is to select hardware and software IP blocks. Depending on the time schedule (for designing the embedded system) and the cost constraint, the designer may wish to use readily available IP blocks from a vendor or implement a custom version of the IP. The target platform is then defined to implement the embedded system. As mentioned earlier, a platform includes one or more processors, memory, and hardware accelerators for specific functions. Platforms also come with software tools such as compilers and simulators, so that the development cycle can be accelerated. In other words, one does not need to wait for the hardware implementation to complete before trying out the software. We show point (4) on the software data layout axis, since the selection of a platform defines many aspects of software implementation. Software partitioning is now performed to decide which software IP blocks are executed on which processor. This completes one spiral cycle in the design life cycle of the embedded system. To recapitulate, the following components are defined at the end of the first cycle (a) the platform on which the embedded system will be built, (b) the hardware and software IP blocks that are selected for the target application, (c) assignment of software IP blocks to target processors where the software will be executed. We show point (5) on the behavioral axis, since the next spiral cycle will begin from here. The next step is to define the logical memory architecture for the memory subsystem. Guided by considerations such as cost, performance, and power, the designer must decide basic architectural parameters of the memory sub-system, such as whether or not to provide cache memory, how many memory banks are provided, whether or not dualported memories are necessary for guaranteeing performance, etc. The next step is to perform design space exploration in the logical space. Each logical memory architecture is also characterized by the selection of values for parameters such as cache size, cache associativity, cache block size, etc. There is often a cost/performance tradeoff between two solutions in the architectural space. Hence the designer must consider different Paretooptimal solutions that exhibit cost/performance tradeoff. This results in point (6) in

32 14 Introduction Figure 1.4. Figure 1.4: Application Specific SoC Design Flow Illustration with X-chart A logical memory architecture must be translated into a physical implementation by selecting components from the semiconductor vendors memory library. There are multiple realizations, i.e., physical memory architectures (PMA) for the same LMA. This involves choosing the appropriate modules based on the process technology selected in step (7), and the corresponding semiconductor vendor memory library. These represent tradeoff in terms of power consumed and VLSI area. This leads to point (7) in Figure 1.4. The mapping of an LMA to a PMA is similar to the technology mapping step in logic synthesis [53]. Data Layout (DL) is the subsequent step in the design life cycle. During this step, the placement of data variables is determined, considering every possible implementation

33 1.5 Embedded System Design Flow 15 Table 1.1: Explanation of Xchart Steps of the physical memory architecture. Once again, there are multiple solutions for data layout for a given PMA. These solutions may exhibit tradeoffs in power, performance, and area. In this thesis, we use the phrase Physical Memory Architecture Exploration (PMAE) to refer to the search for Pareto-optimal LMA/PMA/DL solutions. We capture this in the form of an equation that follows. P MAE = Logical M emory Architecture Exploration + M emory Allocation Exploration + Data Layout Exploration (1.1)

34 16 Introduction In this thesis, the focus is on memory sub-system optimization, constituted by steps (5) to (9) in Figure 1.4. The size of the solution space increases manifold during each step of the memory exploration. If N 1 optimal solutions (logical memory architectures) are identified during memory sub-system definition, memory allocation must be explored for each one of them, which can potentially result in N 1 N 2 solutions during memory allocation exploration. Similarly, data layout must be performed for each of the N 1 N 2 solutions from the memory allocation exploration step, and we may in general obtain N 1 N 2 N 3 Pareto-optimal points in the PMAE solution space. As mentioned earlier this problem can result in exploring a combinatorially exploding large design space. 1.6 Contributions First, we propose methods for data layout optimization, assuming a fixed memory architecture for a DSP-based embedded system architecture. Data layout is a critical component in the embedded design cycle and decides the final configuration of the embedded system. Data layout happens at the final stage in the life cycle of an embedded system, as illustrated in the X-chart of Figure 1.4. Data layout forms the foundation for memory subsystem optimization. Hence, we first formulate data section layout as an Integer Linear programming (ILP) problem. The proposed ILP formulation can handle: (i) partitioning of data between on-chip and off-chip memory, (ii) handling simultaneously accessed data variables (parallel conflict) in different on-chip memory banks, (iii) placing data variables that are accessed concurrently (self conflict) in dual-access RAMs, (iv) overlay of data sections with non-overlapping life times, and (v) swapping of data sections from/to off-chip memory. An important contribution of this work is the development of a simple unified ILP formulation to handle all the above mentioned optimizations. The ILP based approach is very effective for many moderately complex applications and delivers optimal results. However, as the application complexity increases, the execution time of ILP method increases drastically, making them unsuitable for large applications and in situations (such as memory architecture exploration) where the data layout need to be solved repeatedly.

35 1.6 Contributions 17 Hence we looked at developing faster methods to solve this problem. We propose a heuristic algorithm that maps the data sections to the given memory architecture and reduces the number of memory access conflicts resulting from both self conflicts and parallel conflicts. Finally, we also formulate the same problem in Genetic Algorithm (GA) and compare the results of the heuristic with GA. We find that the heuristic algorithm performs within 5% of GA s results with GA performing better. However, the heuristic algorithm s run-time is an order faster than GA s run-time making it suitable to be used for memory architecture exploration. Next, we address logical memory architecture exploration for DSP-based embedded systems (step (5) to (7) in the X-chart of Figure 1.4). The input is a set of high-level memory parameters such as the number of memory banks, size of each memory bank, number of ports etc., that define the memory sub-system. The goal of the exploration is to find an optimal on-chip memory organization that can run the given applications with minimum number of memory-stalls. When an LMA is generated, it must be evaluated for cost (in terms of VLSI area) and performance. But these depend on the data layout. Hence to evaluate a memory architecture properly, we must first generate an efficient data layout. We use the fast heuristic method proposed by us. We have implemented the memory architecture exploration problem as a two-level hierarchical search, with architectural exploration at the outer level and data-layout exploration at the inner level. A multi-objective GA and a Simulated Annealing algorithm (SA) are used as alternate search mechanisms for the architectural exploration problem. As the memory architecture exploration framework consider both performance and cost (VLSI area) objectives, we use the Pareto-optimality constraint proposed in [25] to identify design points that are interesting from one or the other objective. The proposed memory exploration framework is fully automatic and flexible. The framework is also scalable, and additional objectives like power consumption can be added easily. We have used four different applications from multimedia and communication domains for our experiments and found Pareto-optimal design choices (memory architectures) for each of the applications.

36 18 Introduction Next, we explore the data layout design space for a given physical memory architecture in order to optimize the performance and power consumption of the memory subsystem. Note that data layout exploration forms the step (8) to (9) in the X-chart representation. We propose MODLEX, a Multi Objective Data Layout EXploration framework based on Genetic Algorithm that explores the data layout design space for a given logical and physical memory architecture and obtains a list of Pareto-optimal data layout solutions from performance and power perspectives. Most of the existing work in the literature assumes that performance and power are non-conflicting objectives with respect to data layout. However we show that there is a significant trade-off (up to 70%) that is possible between power and performance. Our next step is physical memory architecture exploration (step (5 to 8) in Figure 1.4). We propose two different methods for physical memory exploration. The first approach is an extension of the Logical Memory Architectural Exploration (LMAE) method described in Chapter 4 and represented in X-chart by step 5 to 6. Physical memory exploration is performed by taking the output of LMAE and for each of the Pareto-optimal logical memory architecture, performing a memory allocation exploration (step (6 to 7)) with an objective to optimize power and area in the physical memory space. Note that the data layout is fixed at the logical memory exploration stage itself and hence the performance does not change at this step. The memory allocation exploration is formulated as a multiobjective Genetic search to explore the design space with power and area as objectives. We refer to this approach as LME2PME. The second approach is a direct and integrated approach for Physical Memory Exploration, which we refer to as DirPME. This approach corresponds to a direct move from point 5 to point 8 in Figure 1.4. In this approach, we integrate three critical components together: (i) Logical Memory Architecture Exploration, (ii) Memory Allocation Exploration (iii) Data layout exploration. The core engine of the memory architecture exploration framework is formulated as a Multi-objective Non-Dominated Sorting Genetic Algorithm (NSGA) [25]. For the data layout problem, which needs to be solved for thousands of memory architectures, we use our fast efficient heuristic data layout method.

37 1.6 Contributions 19 Our integrated memory architecture exploration framework searches the design space by exploring 1000s of memory architectures and lists down Pareto-optimal design solutions that are interesting from an area, power, and performance view point. Next, we address the memory architecture exploration problem for hybrid memory architectures that have a combination of SPRAM and cache. For such a hybrid architecture, a critical step is to partition the data between on-chip SPRAM and external RAM. Data partitioning aims at improving the overall memory sub-system performance by placing data in SPRAM that have the following characteristics: (a) higher access frequency, (b) data that has a overlapping life time with many other data, and (c) data that has poor spatial access characteristics. By placing all data that exhibits the above characteristics in SPRAM results in reducing the number of potentially conflicting data in cache, reducing the cache misses, leading to overall memory sub-system performance improvement. But typically the SPRAM size is small and it is not possible to accommodate all the data identified for SPRAM placement. Hence, even after data partitioning, there will be a significant number of potentially conflicting data sections that need to be placed in external RAM. If these data are need to be placed in the caches such that the conflict misses causes between them is reduced. Cache-conscious data layout addresses this problem and aims at placing data in external RAM (off-chip RAM) with the objective to reduce cache misses. This is achieved by an efficient data layout heuristic that is independent of instruction caches, optimizes run-time and keeps the off-chip memory address space usage under check. We extend the above approach and perform hybrid memory architecture exploration with the objective to optimize run-time performance, power consumption and area. The salient feature of our work are as follows. First, we provide a unified framework for logical memory exploration, memory allocation exploration, and data layout Our work addresses power, performance, area optimization in an integrated framework

38 20 Introduction Our work addresses memory architecture exploration framework for a hybrid memory architecture involving on-chip SPRAM and cache. Our work does not rely on source-code optimization for power and performance optimization. Hence it is suitable for Platform-based/IP-based system design 1.7 Thesis Overview The rest of the thesis is organized as follows. In the following chapter, we provide the background material for the thesis. We begin by explaining the memory architecture of a DSP and an MCU. We summarize the software optimizations used in the literature to improve memory access efficiency. We explain cache-based embedded SoC and their challenges with respect to predictability. Finally, we introduce the concepts of a Genetic Algorithm (GA) for optimization, since GA is used in our optimization framework in the latter chapters. In Chapter 3, we propose different methods to address the data layout problem for onchip SPRAM based memory architecture. First, we propose a Integer Linear Programming (ILP) based approach. Further, we also propose a fast and efficient heuristic for the data layout problem. Finally, we formulate the data layout problem in Genetic Algorithm (GA). In Chapter 4, we present a multi-objective memory architecture exploration framework to search the memory design space for the on-chip memory architecture with performance and memory cost as two objectives. We address the memory architecture exploration problem at the logical level. Multi-objectective Data Layout Exploration problem is addressed in Chapter 5. Here, the data layout design space is explored for a given logical memory architecture and application with respect to performance and power. In Chapter 6, we address the memory architecture exploration problem at physical memory level. In this chapter we propose two different approaches for addressing the physical memory architecture exploration.