MULTICORE PROCESSORS AND SYSTEMS: A SURVEY by DM Rasanjalee Himali Ruku Roychowdhury A Survey Submitted in Partial Fulfillment of the Requirements of Advanced Computer Architecture CSC 8210 Fall 2010
Abstract A multicore architecture can be described as an integrated circuit with two or more individual processors which are called cores. The implementation of multicore processors are numerous and diverse. A significant performance advantage as well as improved power consumption has been observed in multicores in recent years. To achieve the advantage of multi-cores, software should contain components that can be parallelized to run on multiple cores simultaneously. The parallelization of software is a significant on-going topic of research. With the increasing capability of multicores to execute multiple threads at parallel to achieve high speed-ups, programmers need to design codes for execution by thousands of processes or threads. One of the major considerations here is how to write programs that can scale to hundreds of thousands of threads. A vast amount of research as been conducted in this area and as a consequence very popular commercially available programming languages and platforms have been developed in the recent past. Resource management is another hot topic that is popular in the research community today. Most of the academic work in this area has focused on modeling contention for LLCs, as this was believed to have the greatest effect on performance. Measuring multicore performance requires new ways of benchmarking. Once new such platforms are devised, there need to be new methods for interpreting the results. Many traditional benchmarks developed for multiprocessor are equally applicable to multiprocessors. There are also numerous new benchmarks proposed by researchers targeted for multicore community. In this survey, we cover the research covered in the above motioned areas with current and future commercial multicore designs. Our main focus of this survey is to present and discuss research conducted in the areas of, architectures of multicores, resource management of multicores, parallelization and programming in multicores, measuring performance of multicore platforms and future challenges in multicores. We also discuss commercially available tools and platforms for multicore programming and testing. 2
TOPIC INDEX Page 1. Introduction.. 4 1.1 Multicore Architectures 4 1.2 The Need for Multicores.. 5 1.3 Multicore Architecture Classification.. 6 1.3.1 Classification based on Application Class..7 1.3.2 Classification based on Memory System 7 1.4 Popular Multicore Architectures.. 8 1.5 On-Chip Interconnections 8 1.6 Future of Multicore.. 9 2. The Conceptual Architecture of Multicores. 11 2.1 Motivation 11 2.2 Model of Computation.11 2.2.1Process Based Models. 12 2.2.1.1 Process Network 12 2.2.1.2 Synchronous Data Flow.14 2.2.1.3 Process Calculi.. 15 2.2.2 State Based Model 16 2.2.2.1 Finite state machine. 16 2.2.2.2 Hierarchical and Concurrent Finite State Machine.. 16 2.2.2.3 Program State Machine. 16 3. Resource Management of Multicore Systems. 18 3.1 Shared Memory Contention in Multicores.. 18 3.2 Shared Memory Management Strategies 19 3.2.1 Cache Partitioning Strategies 19 3.2.2 Contention Aware Scheduling Strategies.. 29 3.3 Power Management in Multicores.. 33 4. Parallelization with Multicores 37 4.1 Background. 37 4.2 Design spectrum of parallelization. 37 4.3 Types of Parallelism 40 4.3.1 Task Parallelism. 41 4.3.2 Data Parallelism. 41 4.3.3 Pipelining 41 4.3.4 Structured grid 42 4.4 Multicore Programming Platforms. 42 5. Measuring Multicore Performance.. 47 5.1 Traditional Benchmarking Methods 47 5.2 Multicore Benchmark Criteria. 47 5.3 SMP Based Multicore Benchmarks. 48 6. Conclusion and Future Challenges of Multicores 59 6.1 Conclusion 59 6.2 Future Challenges of Multicores.. 61 6.2.1 Software Challenge 61 6.2.2 Programmer s Challenge 64 6.2.3 Hardware Challenge.. 65 7. References 70 3
1. INTRODUCTION As personal computers have become more prevalent and more applications have been designed for them, the end-user has seen the need for a faster, more capable system to keep up. Speedup has been achieved by increasing clock speeds and, more recently, adding multiple processing cores to the same chip, called multicores. In this chapter we give a through introduction of multicores, their architectures, application areas and challenges associated with multicores. 1.1. Multicore Architectures Multicore architectures, compared to traditional single core architectures, replicate multiple processors in a single die. Multicore processors are Multiple Instruction Multiple Data (MIMD) architectures in that different cores execute different threads on different parts of the memory. Multicores are generally shared memory architectures. The L1 caches are usually private to cores. L2 caches are private in some architectures and shared in others. Figure 1.1: Multicore Architecture Figre 1.1 depicts a general multicore architecture. The close proximity of multiple CPU cores in a same processor chip allows cache coherence circuits to operate at a higher clock rate. Cores in the die run in parallel. Within each core, threads are time sliced similar to a uniprocessor. Intel call these hyperthreads. The operating system provides each of these cores as a separate processor and maps threads to different cores. Most of the major operating systems such as Windows and Linux support multicores today. The composition and balance of the cores in multi-core architecture show great variety. Some architectures use one homogeneous core design for all cores, while others use a mixture of different cores, each optimized for a different, heterogeneous role. 4
1.2. The Need for Multicores The traditional single core architectures can no longer significantly increase processor performance by frequency scaling. For general-purpose processors, much of the motivation for multicore processors comes from the greatly diminishing gains in processor performance. This comes from three main reasons: (i) The memory wall There is an increasing gap between the processor speed and the memory speed. To mask this memory latency, cache sizes needs to be larger. However, this is not a scalable solution in that because it helps only to extent that memory bandwidth is not the bottleneck in performance. (ii) The ILP wall It is becoming harder and harder to find enough parallelism in a single instruction stream to keep a higher performance single core processor busy. (iii) The power wall Each factorial increase of operating frequency results in an exponential increase in power consumption. The power wall poses manufacturing, system design and deployment problems that have not been justified in the face of the diminished gains in performance due to the memory wall and ILP wall. In addition to these, deeply pipelined circuits in single core architectures result in heat problems, speed of light problems and difficulties in design and verification. Also, many new applications are multi threaded and the general trend in computer architecture has been a shift towards more parallelism. Therefore computer architects needed a new approach to improve performance. An excellent solution was performance scaling through parallel processing using multicores. Multicore is a relatively new concept. Adding additional cores to the same die would, in theory, result in twice the performance and dissipate less heat, though in practice the actual speed of each core is slower than the fastest single core processor. Multicores are capable of eliminate the power wall problem by reducing power consumption by voltage scaling, eliminate memory and power wall problems by reducing DRAM access with larger caches, and allowing multiple threads in multiple cores to process simultaneously. 5
Multicore architectures come with many advantages. These include low power consumption, low heating and smaller device sizes. Also, the proximity of multiple CPU cores on the same die allows the cache coherency circuitry to operate at a much higher clock-rate than is possible if the signals have to travel offchip. When signals travel on-chip instead of off-chip, the signal quality remains high because signals are shorter and degrade less. This large performance gain is likely to be noticed more while running CPU intensive processes in term of faster response times. For example, if the automatic virus-scan runs while a movie is being watched, the application running the movie is far less likely to be starved of processor power, as the antivirus program will be assigned to a different processor core than the one running the movie playback. Multicore architectures have certain disadvantages too. The migration to multicore devices requires complex changes to system and software to obtain optimal performance. To optimize a multicore processor for given set of resources and applications, the operating system as well as the applications needs to be adjusted accordingly. Also, the performance gain seen from multicore processors greatly depends on the use of multiple threads in the application. Many technologies these days provides multicore support: Valve Corporation's Source engine, Emergent Game Technologies' Gamebryo engine and, Apple Inc.'s latest OS, Mac OS X Snow Leopard are some of the popular ones. Also, simply doubling the core frequency does not double performance. 1.3. Multicore Architecture Classification Multicores can broadly be classified into homogenous and heterogeneous multicore architectures. Homogeneous multicores are the ones with cores of the same ISA whereas multicores with different ISAs are called heterogeneous multicore architectures. Homogeneous multicores implement multiple identical cores. The current trend actually is towards homogeneous multicores. Figure 1.3 shows basic designs in multiple CPU systems. According to [5], multicore systems can also be classified based on five most distinguishing attributes: the application class, power/performance, processing elements, memory system and accelerators/integrated peripherals. In this survey explore some of these categories: 6
1.3.1. Classification based on application class The multicore architectures can be designed to reflect the target application domain. Applications for multicore architectures broadly fall into two categories: data processing dominated and control dominated. Data processing dominated applications contain many familiar types of applications including graphics rasterization, image processing, audio processing, and wireless baseband processing. The common feature in these applications is that the computation involves a sequence of operations applied to a data stream with minimum data reuse. Also, these streaming applications which require high throughput and performance are good candidates for parallel operations and they favor designs that have as many processing elements as practical in regards to desired power/performance ratio. On the other hand, examples of control dominated applications include file compression/decompression, network processing and transaction query processing. These applications contain considerable amount of conditional branches in their code and high amount of data reuse. These types of applications favor a more modest number of general-purpose processing elements to handle the unstructured nature of control dominated code. 1.3.2. Classification based on memory system Based on memory system used, multicores can be classified into three main categories: distributed memory architectures, shared memory architectures and hybrid memory architectures. Figure 1.2 shows a multicore architecture classification based on memory designs. In distributed memory architecture, each core typically has its own private memory. The communication between cores usually happens over a high speed network. The most common architecture however is shared memory architecture. Here, the memory is shared by all the cores. In a hybrid architecture, both a share memory and private memories per core exist. Figure 1.2: Memory Designs in Multicore Architectures 7
1.4. Popular multicore architectures There has been massive growth in multicore market. A wide variety of multicores have been developed over the past recent years for the commercial market. Table 1.1 [5] lists some of the general purpose architectures and their characteristics. The multicore architectures: AMD PHENOM [6, 7], INTEL CORE 17 [8,9], SUN NIAGARA [10,11] and INTEL ATOM[12] processors are general purpose multicore architectures. All four are homogeneous architectures with large caches. These four processors are for general purpose desktop and server application where power is not an overriding concern. On the other hand, ARM-CORTEX [13] and XMOS-XS1[14] are intended for general purpose mobile and embedded market. These are also homogeneous architectures that are well suited to control dominated applications. Since theses are developed for embedded/mobile market, many run from batteries thus making power an overriding concern. Table 1.2 shows some multicore architectures that are intended for high performance applications. Therefore they employ larger number of cores. For example, AMD RADEON R700 [15] contains 160 cores while NVIDIA G200 [16] contains 240 cores. Table 1.1: General Purpose Multicore Architectures Table 1.2: High Performance Multicore Architectures 1.5. On-Chip Interconnections There have been several proposals and implementations of high-performance chip multiprocessor architectures [72, 73, 74, 75]. The proposed interconnect for Piranha [72] was an intra chip switch. Cores in Hydra are connected to the L2 cache through a crossbar. In both cases, the L2 cache is fully shared. IBM 8
Power4 [75] has two cores sharing a triply-banked L2 cache. There have been recent proposals for packet based on-chip interconnection networks [76,77]. Packet based networks structure the top level wires on a chip and facilitate modular design. Modularity results in enhanced control over electrical parameters and hence can result in higher performance or reduced power consumption. These interconnections can be highly effective in particular environments where most communication is local, explicit core-to-core communication. However, the cost of distant communication is high. Due to their scalability, these architectures are attractive for a large number of cores. 1.6. Future of Multicores (i) Improved memory system There is an enormous need for increasing memory in multicore systems with numerous cores in a single chip. Today, 32-bit processors, such as the Pentium 4, can address up to 4GB of main memory. With cores now using 64-bit addresses the amount of addressable memory is almost infinite. This situation needs to be significantly improved for multithreaded multiprocessors to provide more main memory and larger caches. (ii) System bus and interconnection networks The interconnection between cores is very important should be a major focus of the chip manufacturers to improve the time required for memory requests. Improved interconnection network and system bus results in faster networks and thus low latency in both inter core communication and memory transactions. Some of the current approaches include Intel s Quickpath interconnect [17], a 20 bit wide bus running between 4.8 and 6.4 GHz, AMD s new HyperTransport 3.0 [18] a 32-bit wide bus and runs at 5.2 GHz. (iii) Parallel programming One of the major challenges with multicores is parallel programming in multicores. Programmers should know how to write parallel programs instead of sequential programs that are capable of running on multiple cores in parallel. However, developers of software for multicores should be concerned with application requirements. For example, a programmer should be able to specify priorities for tasks assigned to different cores. Also programmers should be provided with sophisticated debugging tools to debug programs that run on multicores. Also software developers should provide methods to guarantee that the entire system 9
stops and not just the core on which an application is running. These issues need to be addressed along with teaching good parallel programming practices for developers. (iv) Starvation Proper load distribution between cores is an important factor in multicores. If the program does not do a fair load distribution between the cores, one or more cores may starve for data, while some will be overloaded. Also, with a shared cache if a proper replacement policy isn t in place one core may starve for cache usage and continually make costly calls out to main memory. The replacement policy should include method to evict cache entries that other cores have recently loaded. This becomes more difficult with an increased number of cores effectively reducing the amount of evictable cache space without increasing cache misses. 10
2. CONCEPTUAL ARCHITECTURE 2.1. Motivation Multicore design process is getting complex day be day due to high performance requirements and too many design constraints. Heterogeneous multi-core systems are built of several different processing units, e.g. processors, memories, buses and various other communication interfaces. To add with, there are many possible HW/ SW partitioning schemes as well as parallelization techniques are also involved in the design process. Therefore, design of such systems is extremely complicated and time consuming. The system specification has to be clear and unambiguous as well as provide sufficient expressive power to automate the design process. Concepts and techniques involved at the specification level can affect quality, accuracy and rapidity of results. Therefore, specification model is best to check these validations and visualize analyzability. 2.2. Model of Computation Models of Computation (MoC) [2] act as the building block for defining the multi-core system behavior. There can be specific requirements for specific multicores imposed on the MoC such as Heterogeneous multi-core systems are comprised out of different processing elements and generally need an efficient concurrent model of computation. Therefore, facilities to express the concurrency behaviour are very important. With the aim of making most of the concurrency inherent for the multi-core systems, it is convenient for an application specification to be described in a parallel form. In this way, programming of a multi-core platform can be done in a systematic and automated way. Moreover some other factors as such determinism, predictability, secure and dependable operation through time are equally important as the functionality itself. MoC can handle both fine and coarse level granularity. In the fine level of granularity, basic entities correspond to individual instructions or statements. In contrast, on a coarse level of granularity, basic entities correspond to the entire blocks of code. Therefore, on the coarse level of granularity, based on the 11
basic entities and corresponding composition rules employed, MoC can be categorized into two types, i.e. process based- and state based-moc. 2.2.1. Process-oriented MoC Process oriented MoC [3] describe the system behavior in a form of either concurrent processes that communicate with each other by means of message passing channels or shared memory facilities. In process oriented MoC, stress is put on describing the concurrency explicitly. Such specification is appropriate for implementation on the multi-core platform, which inherently uses concurrency. A deterministic MoC produces the same output whenever it is executed for the same input set. Such behavior is highly desirable in order to perform system behavior validation, but fully deterministic model can sometimes result into over specification. More specifically, the ideal global system should be comprised of several processes to produce the same outputs for a given set of inputs, but the order their execution should remain random. In order to cope with these constraints and requirements, different process based MoC were proposed throughout the years. Some of the most prominent process based MoC include Process Networks, Dataflow Models and Process Calculi. 2.2.1.1. Process Network The main characteristic of a Specialized Process based MOCs is that they can produce dynamic behavior globally while non determinism at an execution of single process level [2]. In KAHN Process networks, Communication happens by unidirectional and point-to-point message passing channels. Such message passing channels are implemented with buffers, which enable asynchronous communication on the server side. The communication channels are unbounded at server so senders can never block. On the other hand, receiver processes always block until required input data from the channel are available. A specific process can wait for a single channel at a time and cannot check the channel without blocking it. Therefore, sequence of channels accesses is predetermined and processes cannot change their behavior in terms of the order upon which data are available on particular channels at run time. Such behavior ensures deterministic system behavior, which does not depend on the order in which processes are scheduled. KPN allow deadlocks, but terminate on the global scale when a deadlock occurs in the system. 12
Fig.2.1 KAHN Process Network Scheduling strategy is another important aspect of KAHN Process Network which directly affects completeness and memory requirements. The two basic policies are given below:- (i) Demand driven scheduling: - Demand driven scheduling strategy corresponds to the behavior where a single process is run whenever its data is needed. If we consider the scenario given in above fig. 2.1 then process P1 will only be executed when P3 requires data and process P2 is executed whenever process P3 or process P4 require data. Such behavior could lead to global artificial deadlocks in the case of the local deadlock of a single process. Suppose P3 is blocked in a local deadlock or does not need any data, so demand driven scheduling would not execute process P2 which subsequently stop independent process P4 to occur. (ii)data driven scheduling:- This scheduling strategy was developed to mitigate the limitations of demand driven strategy. Data driven scheduling runs processes whenever they are ready. Here, the local deadlock problem is avoided but tokens would be accumulated on arcs creating a problem of memory consumption requirements. In general, KPN is determinate, i.e. regardless of the scheduling policy employed, for a given input set; the output will always be the same. This characteristic makes it suitable for designing multicore systems. This gives a lot of scheduling freedom that can be exploited when mapping process networks over various multi- 13
core platforms [2]. The main drawback of the KPN is that it requires a dynamic scheduling with runtime context switching as well as dynamic memory allocation. 2.2.1.2. Synchronous Data Flow KPN requires dynamic scheduling with runtime context switching and dynamic memory allocation. This makes it hard for practical implementation. Synchronous Data Flow (SDF) specification addresses theses shortcomings. SDF is an extension of traditional design flow or data flow models. In Data Flow models, processes are broken down into atomic blocks of execution called actors. The actors are executed once it receives all the required input tokens to avoid context switching in between a running process. On every execution each actor consumes a required amount of tokens and generates a resulting amount of tokens. Fig. 2.2. Synchronous Data Flow In SDF, the amount of tokens consumed and generated by an actor at each firing is fixed. Hence, it can be said that the amount of data flow or control flow in an SDF is predetermined and cannot be changed based on any runtime scenario. As a result static SDF are bounded in nature and the required buffer size for the communication channels are known before runtime. Fig. 2.2 shows an example of SDF with four actors a, b, c, d. On every execution, a produces two tokens, one of which is consumed by b. b produces two tokens, one of which is consumed by c. c produces one tokens and sends it to d. Finally, d consumes and produces two tokens on both its input and output links. The graph is initialized by putting two tokens on the arc between c and d. To schedule such a graph we first have to produce a set of linear equations (Balanced Equations) to determine the relative execution rates of the actors with respect to other actors. For the given graph, the set of linear equations will be- 14
2a = b 2b = c 2b = d 2d = c Which implies we have to execute b twice to execute c once, d twice to execute b once, d twice to execute c once and b twice to execute a once. Finally the system of linear equations reduces to, 4a = 2b = c = 2b Picking the solution with the smallest rates, we have to execute c four times to and b and d each two times for every execution of a. If the equations are inconsistent or not solvable other than setting all rates to zero, the SDF graph can not be statically scheduled or would otherwise lead to accumulation of tokens on the arcs. After calculating the execution rates we can generate a schedule by simulating iteration until its initial state is reached back. If deadlock occurs in between any iteration then token can be placed on any arc to resolve the deadlock. For fig. an example of execution order is adbccdbccfor this schedule the number of tokens accumulate at any time is a maximum of two at each arc and the total memory requirement of 8 token buffers. Though SDF provides a set of significant advantages over KPN like static scheduling, no expensive runtime and context switching, it has its own limitations such as the SDF model cannot express conditional execution of a block [2]. 2.2.1.3. Process Calculi Process calculus gives a high level formal description of interactions, communications and synchronization mechanisms among concurrent processes [3]. Formal description is presented as a set of processes, composition rules and axioms. Specific composition can be of two types, parallel composition and sequential composition. Furthermore, notion of recursion and replication are also supported. Process calculi models are suitable for analysis, equivalence checking and formal verification because of the abovementioned characteristics of formalization and restricted execution. 15
2.2.2. State based Models State based models [3] are described in terms of state machines and consists of a set of states and transitions between those states. State based models put more emphasis on explicitly showing control flow. Typically, states explicitly represent the memory state of the program. The difference of SDF from process oriented MoC, is that it is mainly used in control dominated applications. 2.2.2.1. Finite state machine (FSM) FSM is the basic model in the computer science for modeling various types of applications. It is defined as a quintuple <S, I, O, f, h>, where S represents the set of states, I and O set of inputs and outputs respectively, f: S I S the next state function and h the output function [3]. Traditional FSM are sequential i.e. they can be in one state at a time. Therefore, for every new input a new state should be created which sometimes will lead to a large number of states to model a considerably large system. To solve this problem extension of traditional FSM like FSM with data(fsmd) and hierarchical and concurrent FSM (HCFSM) are evolved. 2.2.2.2. Hierarchical and concurrent Finite State Machine (HCFSM) Hierarchy and concurrency are further techniques to handle the complexity of a system In hierarchical models, the concept of super states [3] has been introduced. Each super state can be a standalone FSM. Entering a super state is equivalent to entering the start state of the FSM within it. Whenever a super state finishes its execution and exits, the parent FSM transitions to another super state. Concurrency on the other hand, breaks complex state machines into multiple simpler FSMs running in parallel. These FSMs can communicate with each other through shared channels, variables and firing events. 2.2.2.3. Program State Machine Program state machines (PSM) [2] can be seen as a combination of KPN and HCFSM. Therefore, PSM make use of the asynchronous execution of the KPN model as well as the notions of hierarchy and control from the HCFSM model. 16
Fig 2.3 Program State Machine As in Fig. 2.3 Processes run asynchronously to each other and can execute concurrently or sequentially. Concurrent processes communicate through message-passing channels. Processes run asynchronously to each other and can execute both concurrently and sequentially. Concurrent processes communicate through message-passing channels that incorporate FIFO buffers in order to provide the asynchronous communication. Such message passing channels are good to separate communication details from computation. 17
3. RESOURCE MANAGEMENT OF MULTICORE SYSTEMS Improved hardware resource utilization is an important aspect in multicore architectures. These shared resources include off chip bandwidth, share memory, power etc. Utilizing better resource management strategies improves the performance and allows for smaller die areas and simpler batter technologies. Shared resources mostly have to do with the memory hierarchy. In following sections we discuss the memory contention problem and the memory management strategies. 3.1. Shared Memory Contention in Multicore Systems Figure 3.1 provides an illustration of a system with two memory domains and two cores per domain. In this system, threads in cores within a same memory domain can compete for shared resources. This result in significant performance degradation compared to the performance a thread could achieve in a contentionfree environment. It has been documented in previous studies that execution time of a thread can vary significantly depending on whether threads run on the other cores in the same chip or not [19, 20]. This becomes especially true in cases where cores share LLCs. For example, in Figure 3.1, core 0 and core 1 are competing for the same LLC while core2 and core3 also compete for the shared LLC. Figure 3.1: A Multicore System with Two Memory Domains When a thread issues a cache request for a line that is not already there in the cache (i.e. caches miss) a new cache line must be allocated to bring the requested line. This becomes a problem if the cache is full when this request arrives because some cache line needs to be evicted to free up the space to bring the new line. It is quite possible that the evicted line belongs to a different thread from the one that issued the cache request thus degrading its performance Modern CPUs do not assure any fairness in this regard. 18
Figure 3.2 [23] shows the cache sensitivity under LRU insertion policy of two SPEC CPU2006 workloads. When both these workloads execute concurrently and share a 2MB cache, soplex, a streaming application, interferes with h264ref. Cache performance can be improved by reducing the interference. Figure 3.2: The Shared Cache Problem. 3.2. Shared Memory Management Strategies Many researchers have proposed strategies to conquer the problem of resource contention. These strategies fall into two main categories: cache partitioning strategies and contention aware scheduling. 3.2.1. Cache Partitioning Strategies Cache partitioning is one of the popular ways to effectively utilize shared resources between cores with minimum contention. In cache partitioning, shared caches such as L2 and L3 caches are usually partitioned among threads that are running simultaneously in multiple cores. Many multicore processors today still use cache designs from uniprocessors. However, many cache partitioning methods have been proposed in the recent past. These focus on a variety of optimization objectives including performance, fairness and quality of service. Cache partitioning strategies proposed to date can be broadly classified to static partitioning schemes and dynamic partitioning schemes. A. Static Cache Partitioning Strategies In static cache partitioning, the cache is partitioned among multiple threads from different cores statically. Once defined, the sizes of the cache partitions do not change. Below we discuss some static cache partitioning schemes that are more aligned with the topic of this survey. 19
(i) Optimal Cache Partitioning In optimal cache partitioning [21] authors propose a new method for optimal allocation of cache memory among competing processes. The authors of the paper focus on two main problems. The first problem is the allocation of interlaced data and instruction processes to cache memory. Authors develop model for a simpler modified LRU replacement strategy and use this model to obtain a model for pure LRU replacement. This modified LRU strategy provides better results in certain circumstances. In this work, the overall miss rate of a cache memory is used as the measure of optimality and an optimal partition is defined to be a partition of cache among competing processes or threads that achieves the minimum miss rate. The second problem is how to allocate memory among processes in a multiprogramming environment. Here, there exists a cache reload transient time associated with the event of a new process taking over the processor. During the early part of this cache reload transient time, the miss rate goes up and eventually goes down when the working set is brought to the cache. Let us now look into how the first problem is solved: the allocation of cache memory between data and instruction streams. In this problem authors consider interlaced instruction and data streams that have different cache behaviors. They show that in this idealized setting, optimal allocation occurs at a point where the competing processes miss rate derivatives are equal. Employing fully associative search in search or replacement algorithms in faster memories such as caches, is infeasible because the high complexity of maintaining LRU information. Therefore, a simpler approach called approximate LRU replacement is taken where a search is performed on a small set of items and replaces the least recently used one in this set if necessary. For slower memories in the memory hierarchy however, fully associative search can be used and therefore true LRU replacement is employed. The main focus of this work is the miss rate as a function of cache allocation of competing processes. The miss rate is assumed to be a function of cache allocation size. For fully associative caches, the miss rate for a given reference stream is a single parameter function and depends only on the number of lines allocated to a process. This is because the entire cache is searched for a match during a cache lookup. For setassociative caches, the miss rate depends not only on how many lines are allocated to a process, but also 20
where they are located in the cache, since only a set of a few lines is actually searched during a lookup. In this work, authors use a simplified model of set-associative caches. In this model, the miss rate is modeled as a single parameter function where the parameter is the number of lines allocated. The physical distribution of these lines of a given set in the cache is considered irrelevant and assumed to have no affect on cache miss rate. This assumption necessarily allows us to treat both set-associative caches and full associative caches in the same manner and therefore also allows us to apply the proposed model to both types of caches. Let us examine the processes that generate the cache references. Assume that an addressreference stream is composed of two interlaced streams of addresses. One stream consists of instruction fetches, and the second stream consists of data fetches. The composite stream is an interleaving of the two streams so that its address references alternate between data and instructions. That is, the stream has the form I, D, I, D,..., where I and D are instruction and data references, respectively. Each component stream has a known cache behavior given by a miss rate for that stream as a function of the cache memory allocated to the process. Let MI(x) be the miss rate for the I stream as a function of cache size z, and, similarly, let MD(x) be the miss rate for the data stream. We assume that both the instruction and data processes are stationary in time, so that the miss rates are not time varying functions. Now we need to determine the optimal fixed allocation of cache for the I and D streams. For this, we find an expression for the misses in a period of time that has exactly T references, and find an allocation at which the derivative of the miss rate function goes to zero. The total number of misses in a time period with T references is the composite miss rate times the length of the period. Since we assume that I and D references occur with equal frequency in the interval T, the total number of misses is given by: To minimize the overall miss rate, the authors minimize the total misses given in (1) by setting the derivative of the right-hand side of (1) to 0, which occurs at a value of x that satisfies: 21
Authors also show that a conventional LRU-replacement policy has a most probable state that is not the optimal allocation of memory between the I and D references streams, but it is capable of producing very good allocations. (ii) Multi-Queue Authors of [22] propose cache management scheme that organizes the cache set into multiple FIFO queues. In a FIFO, each entry corresponds to a single cache line. All of the lines in a single cache set is the collection of all entries in the cache. Figure 3.3 (a) depicts the organization for a single core. As given in the figure, there are two levels of caches and an LRU cache area. To get into the LRU based cache region, an item must pass through first level cache and then the second level cache. Once inserted into a FIFO, each queue entry has a u-bit associated. This u-bit is reset on an entry to any one of queues and updated at every reuse. For a queue of size Q, after Q more insertions, the original line leaves the queue. If its u-bit is still zero, then the cache immediately evicts the line. If the u-bit is one, then the line inserts itself into nextlevel FIFO. This organization allows one to evict rarely used lines earlier than for a shared LRU region. The modified organization for a multicore environment is shown in Figure 3.3 (b). Here, each core is assigned a private first level FIFO. The second level FIFO and LRU managed cache region is shared among cores. Figure 3.3 Logical organization of the basic multi-queue cache management scheme for (a) a single core and (b) multiple cores. 22
The MQ approach is targeting at solving three problematic cache behaviours. First problem is the cache lines that are inserted into the cache but never referenced. There is considerable number of such cache lines at a given time in a cache. If the conventional LRU replacement policy is employed, for a w-way cache, only after w insertions this cache line gets evicted. By occupying space in cache, such cache lines reduce the hits. In MQ approach however, after only Q<w insertions for a Q-entry first level FIFO, such lines gets evicted thereby reducing the residency time of no-reuse lines. The second problem is cache bursts which results from temporal locality. But after initial burst, such cache lines are never reused until eviction. Therefore, only after w more insertions will such a line is evicted. The third problem is the isolation of different cores. A cache with high access rate can preempt quickly cache lines of another core sharing the same cache. Figure 3.4 (left) shows a two core example for a cache that is managed using LRU policy. Here, core 0 inserts one line while core 1 which has a rapid access rate, inserts a large number of cache lines successively. This results in eviction of core 0 s inserted line and other useful lines residing in the cache. Subsequent accesses to these evicted lines will result in cache misses. To solve this problem, MQ introduce a dedicated first level queue to each core, which will isolate traffic of a given core. As shown in Figure 3.4 (right), when a line is inserted into core 0;s first level queue, it is protected from eviction by other cores. Also, since Core 1 s lines show no reuse, then these lines are evicted as soon as they are dequeued from Core 1 s first-level queue. Other cache lines with proven reuse (those with the striped patterns in the figure) are maintained in the cache in the second-level queue or the LRU region. This necessarily allows occurrence of additional cache hits. 23
Figure 3.4 (left) Two cores where Core 0 reuses its lines and Core 1 does not, but Core 1 accesses the cache at a much faster rate. (right) The same scenario when Core 0 and 1 have their traffic isolated using multiple queues. B. Dynamic Cache Partitioning Strategies Dynamic cache partitioning is more successful than static partitionining as this allows partitioning the cache among cores at run time based on need. A recent study [27] showed that dynamically changing the insertion policy can provide high-performance cache management for private caches at negligible hardware and design overhead. Here we describe some of the popular schemes. (i) TADIP Thread-Aware Dynamic Insertion Policy (TADIP) [23] is an adaptive insertion policy of shared caches among competing threads running in multiple cores. TADIP aims at achieving four goals: high performance, robustness, scalability and low design overhead. Number of cores and concurrent threads is expected to increase in future processor designs. Therefore, TADIP aims at providing a insertion mechanism that is scalable with number of cores and threads while producing minimal affect on performance where LRU policy works better. TADIP is also is a negligible hardware overhead mechanism to manage shared caches. 24
The Dynamic Insertion Policy (DIP) proposed in [27] primarily use two policies: the Bimodal Insertion Policy (BIP) and LRU policy. In BIP inserts majority of incoming cache lines in most recently used position while selectively inserting few in least recently used position. The set dueling principle is used in DIP to dynamically select the best policy for that particular time. Set dueling monitors are used to keep track of the misses generated by each policy. For this, specific set of caches are dedicated to run either of the policy. The wining policy which is the policy with least misses is always followed by the rest of the cache sets. However, DIP does not take into account the possibility of multiple threads competing for a shared cache. In TADIP, authors propose an extension to DIP which addresses the shortcomings in DIP. Authors show that TADIP outperforms traditional LRU policy and leaves significant room for improvement. TADIP needs to make a binary decision between LRU and BIP for each competing application running on cores. For N concurrently executing applications sharing a cache, the search space is considered a N-bit binary string and therefore there are 2N possible strings and the best performing string needs to be selected. When N is small enough, the set dueling principle can be used to select the best performing string. However, the problem is the number of set dueling monitors needed exponentially increases with increasing number of competing applications running concurrently. Therefore, authors propose two scalable approaches that solve the problem of exponential increase. The exponential search space is reduced to a linear search space by filtering out applications that do not benefit from cache using BIP by individual learning of the best insertion decision for each application. TADIP provides two main flavours: TADIP-I and TADIP-F which improves upon TADIP-I. TADIP-I stands for TADIP-Isolated. It tries to learn the insertion decision of each competing application in isolation. N+1 Set Dueling Monitors (SDMs) are used for this purposed for the N applications competing for the same shared cache. Each SDM learns the insertion policy of each application in an independent manner. The baseline SDM (i.e. the first SDM) use a fixed policy, namely the traditional LRU policy for one application. Remaining N SDMs (called bimodal SDMs), use BIP for one application and LRU policy for others. Figure 3.5 (b) depicts TADIP-I scheme for a cache shared by 4 applications. Given a binary string <P0,P1,P2,P3>, the insertion policy for Application 0 is P0, Application 1 is P1, and so on. Bimodal insertion policy (BIP) is used when Px is 1, otherwise the LRU policy is used. Px is the MSB of a policy 25
selection (PSEL) counter. Both TADIP schemes require a per-core PSEL counter. As depicted in the figure, the baseline SDM uses <0,0,0,0> binary string which indicates that all applications use LRU policy. The four bimodal SDMs use the binary strings <1,0,0,0>, <0,1,0,0>, <0,0,1,0>, and <0,0,0,1> respectively. Figure 3.5: Adaptive Insertion Managed Shared Caches. Three schemes for managing a cache shared four by 4 applications. (a) DIP (b) TADIP-Isolation (c) TADIP-Feedback. Set Dueling Monitors (SDMs) estimate misses for a given policy and follower sets use the best performing policy. The PSEL counters associated with each application is used to select the best insertion policy. All PSEL counters for each application are incremented whenever the baseline SDM encounters a miss. On the other hand, bimodal SDMs decrement only the associated application PSEL counters in a miss. However TADIP-I has one major problem: the insertion decision of an application might be dependent on another application s insertion decision. To avoid problems in TADIP-I, TADIP-F, i.e. TADIP with feedback, is introduced by the authors. To learn the winning insertion policy, TADIP-F uses 2N SDMs per each application. Figure 3.5 (c) depicts TADIP-F scheme for four applications sharing one cache. For four applications, eight SDMs are used, two per each application. (ii) UCP Utility Based Cache Partitioning (UCP) [49] is a low overhead and high performance portioning scheme for shared caches. UCP is based on the idea that utility of a cache varies widely based on the application. If 26
two applications having low utility are executed together, then their performance is not sensitive to the amount of cache available to each application. On the other hand, if two applications executing together are having saturating utility, the cache is capable of supporting the needs of both applications. However, when two applications with low and high utilities execute together, it is possible that the working set of the high utility application is not kept. Therefore, it is important to partition the cache among applications by taking into account the utility of each application. Authors provide this by quantitatively defining cache utility for a single application on a way basis the cache is allocated. For an application that has misses, miss a and miss b for a and b ways, the utility of of increasing numbr of ways from a to b ( ) is defined as follows: Figure 3.6: Framework for Utility-Based Cache Partitioning. Figure 3.6 the proposed UCP framework. Figure depicts two applications that that compete for the shared L2 cache in a duel core system. Each application resides in the two separate cores. A utility monitor circuit (UMON) is dedicated to each core for the purpose of monitoring of the application running in that core. To allow the UMON circuit to obtain utility information for all the ways in cache, it is implemented separately from the shared cache. Based on these utility information collected, the partition algorithm decides the number of ways to allocate to each core. To do the utility monitoring effectively, there needs to be a method for monitoring number of misses for all possible ways of cache. For example, for a 8-way cache, the UMON should track misses for 8 possible ways from only one way allocated to an application to all 8 ways allocated to an application. The brute force approach is to maintain 8 tag directories, each with the same number of sets as the shared cache but with a different number of ways than others. The high 27
hardware overhead for multiple directories makes this approach impractical. However, the baseline LRU policy obeys the stack property. i.e. access that hits in n-way cache also hits in mare than n-way cache. Therefore it is possible to compute hit and miss information about all possible ways with a single tag directory. For an n-core system, n UMON circuits are needed, one per each core. To reduce the hardware overhead of UMON, the authors use Dynamic Set Sampling (DSS) which allows to approximate the behaviour of the cache using only few sets. The hit counter information in UMON is approximated using DSS. (iii) Dynamic partitioning of shared cache memory [48] presents a dynamic cache partitioning algorithm which allocate cache among simultaneously executing processes such that overall cache misses are reduced. This scheme dynamically estimates each processes gain or loss based on set of online counters in different cache allocations in terms of number of cache misses. Then based on this estimate, the cache allocation to processes is dynamically changed so that processes at higher losses are allocated more cache space. For N concurrent processes competing for a shared cache of C blocks, with partitioning on a block basis, the problem is to partition the cache into N disjoint subsets of cache blocks in a such a way that minimize number of misses. The partition is fixed over a given amount of time T.Given ci is the number of cache blocks allocated to the ith process over T, the cache partitioning among processes is specified by the number of cache blocks allocated to each process: {c 1,c 2,c 3,..,c N }. Also give that m i (c) is the cache misses for ith process over T as a function of partition size, the optimal partition for time period T is given as { c 1,c 2,c 3,..,c N }that will minimize total misses over T as given below: Given that where, C is the total number of cache blocks. Authors define the marginal gain of a process g i (c) as: 28
which is simply the derivative of miss curve m i (c) at a given cache space c. This marginal gain represents the total cache misses that can be reduced using one extra cache block. Therefore it indicates the benefit of increasing the cache allocation from c to c+ 1 block for a process. These marginal gains need to be calculated online for different cache sizes. Authors use a set of counters for this purpose. For a fullyassociative cache with C blocks, it is possible to compute g(c) over a time period T on-line using C counters. Computing the marginal gain simply follows from a set of counters. The marginal gain g(c) is obtained directly by counting the number of hits in the c+1th most recently used block. 3.2.2. Contention Aware Scheduling Strategies (i) Cache-Aware Scheduling for Multicores ( ) [50] presents a scheduling strategy for competing tasks at multiple cores with timing and cache allocation constraints. The advantage of this method is that it allows each task to use a fixed number of cache partitions. This essentially allows a cache partition to be used by only one task at a given time. Therefore, the cache space allocated for tasks are isolated. Authors assume that there exists cache partitioning algorithms which can divide the shared cache space into non-overlapping partitions which allows independent use by tasks, as shown by Figure 3.7 (a). Figure 3.7: Cache space isolation and page coloring For a multicore platform with of M cores and A cache partitions, and a set τ of independent sporadic tasks whose numbers of cache partitions and WCETs are known for the platform, the task model is defined as 29
follows: A task is defined as where, Ai is the cache space size, Ci is the worst-case execution time (WCET), Di <=Ti is the relative deadline for each release, and Ti is the minimum interarrival separation time also referred to as the period of the task. Authors assume the tasks are ordered by priorities. The utilization of a task is defined as and the slack is defined as. Slack is the maximum delay allowed before missing a deadline. The authors basically focus on a much simpler non-preemptive fixed-priority scheduling (FPCA) as it is difficult to predict overhead of each task due to preemption. This scheduling algorithm is triggered as a result of a job completion or job arrival. Given that there are enough resources available, the highest priority jobs are scheduled for execution. More specifically, a job Ji is scheduled for execution if following conditions are true: 1. Ji is the job of highest priority among all waiting jobs, 2. There is at least one core idle, and 3. Enough cache partitions, i.e. at least Ai, are idle. There is at most one job of each tasks due the assumption of. Figure 3.8: Example for illustrating scheduling algorithm An example of the task set is shown in Figure 3.8. Table 1 in figure shows the tasks scheduled by. At time 0, cannot execute due to the constraints in. Even though is ready to execute, and a 30
free idling cache partition is available, a higher priority job that is needed to be in execution. However there is not enough idle cache partition available for this high priority job to execute. This is a limitation in algorithm: a low priority job is not scheduled even with enough resources when a high priority job needs to wait thus wasting valuable resources. This type of scheduling is called blocking-style scheduling. In contrast, a scheduling policy that allows executing low priority ready job prior to high priority ready job which has not enough idle cache space is called non-blocking-style schedule. (ii) Cache Aware Multicore Real-Time Scheduler [51] describes a cache aware soft real time scheduler that will reduce cache miss rate. Authors assume the system is modeled as a set of multi-threaded tasks (MTTs). Each MTT has a set of sequential tasks each with a common period. They may have different execution costs. The MTTS are used to specify groups of cooperating tasks referencing a common set of data. MTTs allow concurrency within task models that typically handle only the sequential execution of tasks. Processing power of each core is likely to remain the same as per-chip core counts increase. Therefore, MTTs should be useful for achieving performance gains. Authors use G-EDF scheduling as a baseline for evaluating the performance of proposed cacheaware scheduler. In G-EDF scheduling, jobs are scheduled in order of increasing deadlines, with ties broken arbitrarily. G-EDF is not an optimal scheduling policy,. Therefore, it is likely that tasks may miss their deadlines. It has been proven by research that this latency is bounded. G-EDF as used as the scheduling heuristic in this system. Here, per job working set size(wss) per each MTT indicate the cache impact. Authors also provide a profiler that gives per-job-wss for each MTT. This is used for scheduling using the heuristic. The reason for profiling MTTs rather than tasks is because MTTs share a common working set. This profiling happens online while the job execution is taking place. Performance counters are used by profiler to record MTT. Each core is associated with a set of performance counters. These can be used to monitor events originating from the core. Authors set each counter to track shared cache misses. Jobs are executed sequentially. This allows to track the number of cache misses by resetting the counters to zero at 31
beginning of execution. The misses are read at the end of execution of job sequence. It is these misses that authors use to calculate per-job WSS. (iii) Symbiotic Job Scheduler (SOS) [52] introduce a job scheduler called SOS for Simultaneous Multi Threading (SMT) architecture. This work is based on the fact that performance on hardware SMT processor is sensitive to the set of jobs that are coscheduled by the operating system scheduler and to get the benefit of SMT environment the scheduler should have the intelligence to identify interaction between competing threads. The SOS scheduler is proven experimentally to improve performance of SMT architecture significantly. One significant advantage of this method is that SOS does not assume any prior knowledge of workload characteristics. Instead, sampling techniques are used to identify threads that minimize contention for shared resources. SOS scheduler run jobs in groups same as the SMT level. The jobs are grouped based on some selected fair policy that will allow jobs to make progress. The SOS scheduler first enters a sample phase. SOS scheduler in this phase permutes the scheduled periodically. This essentially change the jobs co-scheduled into groups periodically. During this sample phase, SOS scheduler collects dynamic execution profiles of executing jobs using the hardware performance counters. After sampling the performance of several schedule permutations, SOS selects as the best schedule the one selected to be optimal and run it for the rest of the time until jobs are completed. Authors also define a measure of the goodness (i.e. speedup) of a co-schedule. Intuitively, if one job schedule executes more useful instructions than another during the same time interval, the first job schedule is decided to be more symbiotic and show higher speedup. This essentially suggests IPC (Instructions per cycle) is a good measure of speedup. However, the problem with this approach is that an unfair schedule can show good speedup, at least for a while, by favoring high-ipc threads. Therefore to ensure that SOS is measuring real increases in the rate of progress through the entire job mix, authors define the following measure: 32
where WS(t) is the contribution of each thread to the sum of total work completed in the interval by dividing the instructions executing on each job's behalf by its natural offer rate if run alone. 3.3. Power Management in Multicores Today, many embedded systems such as mobile phones and PDAs are designed using complex multicore SoC platforms. For such multicore platforms, sitable power management techniques need to be devised. The simplest method of power managing multicore chip is to simply apply well-known single core techniques to every core. The problem with this approach however is that this is inefficient because it cannot take advantage of peak power averaging effects that occur across multiple cores. Therefore much research has been conducted in this area [53, 54, 55] to provide improved power management techniques. Here we discuss some of such important techniques for power management in multicores. Figure 3.9: Real temperature of one core on running bzip2 benchmark [53] (i) Predictive Dynamic Thermal Management for Multicore Systems (PDTM) [53] describes a Dynamic Thermal Management (PDTM) based on Application-based Thermal Model (ABTM) and Core-based Thermal Model (CBTM) in the multicore systems. Per each core, PDTM use an advanced future temperature prediction model to estimate the thermal behavior. This temperature 33
prediction model makes use of both core temperature and applications temperature variations. Based on the predictions, appropriate actions are taken to avoid thermal emergencies. In this work, the prediction model is called DBTM. Using the application thermal behaviour, ABTM predicts future temperature. On the other hand CBTM uses steady state temperature and workload to estimate the core temperature pattern. This temperature prediction model and the with the thermal-aware scheduling mechanism has been implemented by the authors on a real four-core product under Linux environment. The experimental results on Intel s Quad-Core system running two SPEC2006 benchmarks simultaneously shows the proposed PDTM lowers temperature by about 5% in average and reduces up to 3 % in peak temperature with only at most 8% performance overhead. Figure 3.9 depicts the rapid temperature changes even when the workload is statically 100%. To predict future temperature in fine granularity, ABTM use short term thermal behavior. ABTM first derives the thermal behavior from local intervals (i.e. short term temperature reactions) and then predicts the future temperature by incorporating this behavior into a regression based approach that is known as the Recursive Least Square Method. In the general least-squares problem, the output of a linear model y is given by the linear parameterized expression. where, u = [u1,u2,,un ] is the model s input vector, f1,...,fn are known functions of u, and θ1, θ2,..., θn are unknown parameters to be estimated. Using this equation, ABTM can predict future temperature for an application as shown in Figure 3.10. Figure 3.10 : Calculation of t (migration time) using ABTM 34
(ii) Dynamic Multicore Power Management (DPM) To address the problem of dynamic power management, authors of [54] propose a formal verification model of DPM scheme. This model incorporates probabilistic model checking to estimate the required verification effort. This model checker is capable of providing information on how certain design parameters impact this effort. Figure 3.11 depicts the current industrial workflow in the DPM scheme. The unshaded portion shows development of the new DPM. This workflow is however is prone to missing bugs. This new DPS method addresses the above concerns by introducing an additional, early step in the development of a new DPM scheme. The shaded portion shows this added step in Figure 3.11. The purpose of this additional step is top create a high-level model of the proposed power management policy at an early design stage. Then the probabilistic model checker is used to verify this high level model for efficiency and safety. Probabilistic model checking is an exhaustive formal verification method. The advantage of performing a high-level verification early in the development process is that problems can be identified at an early stage when they are easier to solve. Also, a high-level model is much easier to develop and modify than a detailed simulator and various design can be verified quickly. Figure 3.11: Workflow for DPM Scheme 35
To estimate the effort required to verify the DPM scheme is measured as number of reachable states and transitions. This is estimated using the model checker. Model checker provides a better understanding of the impact on verification effort of scaling certain design parameters. However, it should be noted that model checking does not eliminate the need to later simulate a detailed implementation of the DPM scheme. However, it can catch bugs early and help the simulation reach desired state coverage goals. 36
4. PARALLELIZATION WITH MULTICORES 4.1. Background Parallelization is the most significant feature of multicore and it is also the motivation behind the development of multicore processors. In the evolution of high performance multicore processors, multithreaded CPUs were the stepping stone. In multithreaded system hardware-level context switching between threads is used to reduce the idle time of. Shortly after, designers integrated more than one processor core onto a single chip. Eight core processors are common these days, with forecasts for CPUs with more and more cores becoming available in the near future. Assuming that Moore s Law holds, we expect a doubling of the number of cores on chip every two years, leading to many-core CPUs (16 or more cores) just over the horizon. Multithreaded and multicore CPUs both exploit concurrency by executing multiple threads, though their designs target different objectives. Multithreaded CPUs support concurrent thread execution by issuing instructions from multiple threads. Multicore CPUs achieve thread concurrency by increasing scalability via replicating cores. These CPUs are often called CMP which stands for Chip Multi-Processing. Most recent CPU and GPU (Graphics Processing Unit) designs like the Sun UltraSPARC T2, IBM POWER6, Intel Xeon, ATI RV770, and NVIDIA GT200 combine both of these design options and have multiple multithreaded cores. [24] 4.2. Design spectrum of parallelization The performance of these parallel systems depends upon a wide range of design spectrum as given below [24]:- Multithreaded Cores All multithreaded cores keep multiple hardware threads on-chip, ready-for-execution. This is necessary to minimize context-switching cost. Each on-chip thread needs its own state components such as the instruction pointer and other control registers. Thus, the number of on-chip threads determines the number of required of state components to be replicated and subsequently the maximum degree of hardwaresupported concurrency. 37
There exist a variety of approaches of switching between threads per core, which range from alternating threads to actually issuing instructions from several threads each cycle. First one is called Temporal Multi- Threading (TMT). Most current CPUs employ the later approach which is called Simultaneous Multi- Threading (SMT), one of the most common example of SMT is HyperThreading Technology (HTT) by Intel. Multicore CPUs Hardware multithreading per core has limited scalability whereas multicore CPUs are more promising for scalability. Most early multicore chips were constructed as a simple pairing of existing single-core chips as in the Itanium dual-core [24]. Like their predecessors, these chips replicate only the control and execution units and share the remaining units per chip. However, sharing has disadvantages regarding contention on the shared resources. The current trend is toward replicating more components such as memory controllers and cache to reduce the contention and communication overhead. Integration of on chip components The number and selection of integrated components on chip is an important design decision. Possible components to include on-chip are memory controllers, communication interfaces, and memory. Placing the memory controller on-chip increases bandwidth and decreases latency. Some designs support multiple memory controllers to be integrated to make memory-access bandwidth scalable with the number of cores. Integrating a GPU core on chip is another promising technique and might become common in next generation multicores. IBM s Blue Gene/P [25] system relies on a highly integrated system-on-a-chip design which features four cores, five network interfaces, two memory controllers and 8MB of L3 cache. Because each Blue Gene compute node is so highly integrated, the system scales to hundreds of thousands of processors. Shared vs. Private Caches Aside from concurrency, caches are the most important feature of modern CPUs to enhance performance, due to the performance gap between CPU speed and memory-access times. For multicore the organization of cache memory also play an important role. Most current multicore-chip designs have a private L1 cache per core to reduce the amount of contention for this critical cache level. The assignment of L2 cache in multicore designs varies. L3 cache was historically off-chip and shared with some exceptions. Whether 38
shared or private caches are more beneficial depends on the application characteristics to a great extent. Shared caches are important if threads of the same application are executing on multiple cores and share a significant amount of data. In this case, a shared cache is more economic as it avoids multiple copies of data and cache-to-cache transfers. However, shared caches can impose high demands on the interconnect [26]. Fault Tolerance Dynamic partitioning of cache or other resources can be extended to deal with hardware faults which are more likely to occur with higher circuit density. Faults can result in electrical noise or minor permanent defects in silicon, potentially spreading from individual components to failure of the entire chip. Some CPUs disable faulty cores at fabrication time to increase performance. Additionally, fault tolerance may comprise partitioning of replicated and separable units such as multiple inter chip interconnects and memory controllers in addition to multiple cache banks. This leads an increase in overall availability and provides graceful performance degradation in case of faults. Interconnect Another important feature with impact on performance of multicore chips is the communication among different on-chip components: cores, caches, and if integrated memory controllers and network controllers. Initial designs used a bus as in traditional multi-cpu systems. The trend now is to use a crossbar or other advanced mechanisms to reduce latency and contention. For instance, AMD CPUs employ a crossbar, and the Tilera TILE64 implements a fast non-blocking multi-link mesh. However, the interconnect can become expensive: an 8 x 8 crossbar on-chip can consume as much area as five cores and as much power as two cores. [26] If only private caches are there on-chip, data exchange between threads running on different cores use offchip interconnect. Therefore, introducing a level of shared cache on chip or any supporting data-exchange short-cuts such as cache-to-cache transfer help to decrease the off-chip traffic [3]. However, more on-chip cache levels put additional requirements on the on-chip interconnect. As data processing increases with more thread-level parallelism, demands also typically increase on the off-chip communication fabric for memory accesses, I/O, or CPU-to-CPU communication. To address these requirements, the new trend in off-chip communication is packet-based, point-to-point interconnects. 39
Core Complexity vs. Number of Cores- In terms of performance Traditionally CPUs were used to use out-of-order execution, dynamic branch prediction, and longer pipelines to optimize the CPU time. As the thread level parallelism was introduced, CPU designs might become less complex to dedicate more circuitry for concurrency. Some CPUs on the other hand, have chosen more complex designs to provide better performance for a single core. This is an important design decision and depends on many factors. An important consideration is that chip concurrency cannot be exploited by serial programs. Even in parallel programs, some parts of the algorithm must run sequentially, and by Amdahl s Law we know that that the maximum speed of an algorithm is determined by the percentage of its sequential part. Amdahl s law can be extended for multicore systems to deduce the speedup for the symmetric multicore chips as,, asymmetric multicore chips as and dynamic multicore chips [67].Amdahl s law results and other considerations can give significant deduction about balancing core complexity and number of cores: Larger numbers of simple cores are better for applications having small serial part; otherwise more complex cores are better [28]. More number of simple cores might cause into greater demand in the interconnect and in turn make complex cores more beneficial. As well as large number of cores sometimes have limited scalability due to lack of synchronization overhead, or load imbalance. Highly sensitive parallel applications such as floating point intensive or highly instruction-parallel numeric applications may experience higher returns from complex cores. 4.3. Types of Parallelism Current multicore processors exploit a wide range of parallelism techniques. The choice between these techniques depends both on application characteristics and hardware architecture. The categories are given below:- 40
4.3.1 Task Parallelism: - A task is broken down into multiple independent subtasks. These subtasks can run on different threads or on different cores simultaneously in parallel[30]. Figure 4.1 Task level parallelisms 4.3.2 Data Parallelism: - In data parallelism, a big chunk of data is broken down into smaller datasets so that they can be operated in parallel. After the data has been processed it is combined to a single result [31]. Figure 4.2 Data Parallelism 4.3.3 Pipeline Parallelism: - This is a strategy to divide a serial task into multiple stages and run the stages in parallel like an assembly line for performance gain [32]. Figure 4.3 Pipeline Parallelism 41
4.3.4 Structured Grid Parallelism: - In structural grid, the data is arranged in a multi dimensional arrays or grid. A stencil function computes a new element value from the current element value and neighboring cells. Typically many iterations of a stencil function are performed, producing a series of data, until some convergence condition is reached [33].Structured grid programs are highly data-parallel. Figure 4.4 Structured Grids 4.4 Multicore Programming Platform Multicore processors require sophisticated parallel programming platform along with complicated hardware structures to achieve the desired level of parallel performance. Parallel programming languages were introduced long back and Message Passing Interface (MPI) and Open Multiprocessing (OpenMP) were the most dominant programming languages in the last decade [34], a huge number of parallel programming languages are available which provide parallel programming features on the current variety of multicore processors. Some of these laguages are extensions of the general sequential programming language e.g. Unified Parallel C (UPC) and Sequoia are two languages which are parallel extension of traditional C. Some of the popular parallel programming languages are discussed below:- CILK CILK was developed by CILK ARTS at MIT. It is an algorithmic multithreaded language [35]. The programmer s job is to exploit the locality and expose parallelism. The responsibility of the run time system is to scheduling the applications, load balancing and communication between the parallel modules. Basic CILK language is an extension of C language with three keywords to support parallelization. As we can see in Fig. 4.5, it is very similar with C only having three extra keywords CILK, SPAWN and SYNC. 42
CILK is used to identify the parallel procedures. SPAWN is used to create child procedure. Any number of child procedures can be created to increase the degree of parallelism. Child can also spawn other children creating recursion. SYNC is used for synchronization between parent and children procedures. By mentioning SYNC, a parent process is enforced to wait until all its children finishes execution. Another interesting feature of CILK language is Work Stealing [44]. During the execution of a Cilk program, when a processor runs out of work, it asks another processor chosen at random for work. A procedure saves the programming context and local variables at one end of the stack and commences with the child when it spawns a child. When another procedure wants work, it just pop the stored work from other end or pops the work which is waiting for the longest time. Figure 4.5 Sample Fibonacci program (a) sequential C program, (b) Parallel CILK programming 43
CUDA CUDA was developed by NVIDIA and was the first mainstream GPGPU programming technology. The unique feature of the programming model is the CUDA thread grids. The programming unit is functions specified by the name kernel. Kernels are run by thread blocks. The thread blocks are independent of each other and run in parallel, the threads in one thread block share memory space [36]. These thread blocks are organized into multidimensional grids. CUDA supports data parallelism by running parallel kernels on one or more thread blocks. CUDA supports task parallelism by executing different kernels in different thread blocks since thread blocks execute independently of each other. Fig. 4.6 CUDA threads blocks Intel TBB Intel TBB (Thread building blocks) are c++ template libraries in extension to traditional c++ to provide parallel programming feature in multicore systems [37], the library consists of data structures and algorithms avoid the scenario where individual threads of execution are created, synchronized, and terminated manually. Instead the TBB library abstracts the operations to be treated as "tasks," which are allocated to individual cores dynamically by the library's run-time engine, and by automating efficient use of the cache. TBB employs a task stealing concept which is very similar with CILK s work stealing feature. Tasks are divided uniformly to all the cores. If a core finishes early while others have substantial work left, TBB reassigns some work from the busy core to any of the idle core. TBB library is a collection of components for parallel programming shown in fig. 4.6. 44
Figure 4.7 Intel TBB libraries OpenMP OpenMP is an Application Program Interface (API), which provides a portable and scalable model for developers of shared memory parallel applications. The API supports C/C++ and FORTRAN on multiple architectures, including UNIX & Windows NT. Then OpenMP programming model provides shared memory thread based parallelism where multiple threads exist in shared memory paradigm. It is explicitly parallel giving the programmer full control over the parallelization. Every OpenMP application starts with only one thread- i.e. the master thread. The master thread continues to run sequentially until it enters the parallel region. It then creates a bunch of parallel threads by Fork call. When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread. OpenMP API also provides features like nested parallelization and Dynamic threads. SWARM (Software and Algorithms for Running on Multi-core) The rapid growth in the parallelism and performance of multicore systems suggest that only parallel programming techniques are not sufficient to maintain speed with hardware growth. Exploiting concurrency at the algorithmic level is another step ahead to achieve even better performance. SWARM is an open-source portable parallel programming framework [68] which aims to simplify algorithm design on multicore systems. This library supplies basic functionality for multithreaded programming, such as synchronization, control and memory management to implement parallel algorithms successfully. A user 45
needs to make only minimal changes in his sequential code to change it to parallel. As a first step the compute intensive part is identified and later those parts are assigned to multiple cores to exploit thread and loop parallelism. SWARM uses predefined primitives for data parallelism, maintaining control between threads and memory management. XJava XJava was developed as an extension to object oriented programming language to support parallelism. It uses three concepts stream languages, parallel design patterns, and object-orientation altogether. The central construct of XJava is the task. Tasks are similar to method declarations. Tasks can be declared in classes and interfaces and inherit other tasks [70]. A task runs as a separate thread and has an input and an output port. It receives data at its input port and generates data at its output port. Tasks can be categorized into two subtypes named Periodic and Non Periodic tasks [70]. Periodic task can describe exactly one work block inside its task bodies. On the contrary, non periodic tasks do not have a work block. They may contain parallel statements. Individual tasks can be combined using special operators into parallel statements. The => operator connects tasks via their input/output ports and form a pipe statement. The operator creates a concurrent statement by joining parallel executable tasks. 46
5. MEASURING MULTICORE PERFORMANCE Even though a normal computer scientist is familiar with the gain form multicores, many embedded system designers are still struggling to determine whether multicore really buys them anything in terms of performance [56]. To resolve these problems, one must get a thorough understanding of the target application, the characteristics of multicore processors that could be used, and the amount of time that must be invested to make the transition. To amplify the problem, many multicore products in the form of application-specific systems on a chip. SoCs are on the market. This can be as simple as a processor with a general computing core plus a digital signal processing core. 5.1. Traditional Benchmark Methods This section provides a brief history of traditional benchmarking methods. Before, multicores, or SMT processors were developed, benchmarks that modeled a single processor core s internal workings were sufficient. These can still be used depending on the processor characteristics that are modeled. Most of the EEMBC first-generation benchmarks fall into this category. However, one of the major problems with these traditional benchmarks is that they exercise a single processor core and have little interaction with the external memory. Many such benchmarks can run on top of most operating systems and the performance is measured in iterations per second. In this type of benchmarking, the compiler also plays a big role. [56] notes that as much as 70 percent performance difference depending on the compiler used to generate the benchmark results has been observed. 5.2. Multicore Benchmark Criteria The multicore benchmark criteria greatly depend on the multicore performance characteristics: A. Memory bandwidth Memory bandwidth is a major feature that influences performance, regardless of the type of multicore architecture. A multicore processor s memory bandwidth depends on the memory subsystem s design which in turn depends on the underlying multicore architecture. The shared memory in multicore systems are accessed through a bus and controlled by some type of locking mechanism to avoid simultaneous 47
access by multiple cores. This provides a straightforward programming model because each processor can directly access the memory. B. Scalability Another important benchmark criterion is scalability. If the processor oversubscribes/ allocates processor s computing resources, it incurs performance penalties. For example, the scalability should be good with respect to the number of threads used in an application program. Typically, the number of threads could be hundreds for a relatively complex program. However, in an ideal case where the number of threads exactly matches the number of processor cores, performance could scale linearly assuming no limitations on memory bandwidth. However, this is not realistic. It is often the case that the number of threads will exceed the number of cores. Therefore, the performance will depend on other factors such as cache utilization, memory and I/O bandwidth, inter-core communications, OS scheduling support, and synchronization efficiency. 5.3. SMP based Multicore Benchmarks One of the major benefits of multicores is that it can exploit parallelism to improve performance of individual tasks compared to overall performance gain. To achieve the advantage of parallelization, one should use benchmarks that utilize task decomposition, functional decomposition, or data decomposition. To exercise the major multicore benchmark criteria, such decomposition methods must be supported by a benchmark. While many techniques and profiling tools are available for measuring performance on homogeneous multicore platforms, most of them depend on the hardware support from the vendors. For developing applications on heterogeneous multicore systems, very few analysis tools exist to help the developers Here we describe some of such popular benchmarks: (i) Parallel Tracer [57] introduces a profiling toolkit for multicores called Parallel Tracer. This consists of a a software-based trace collection and performance analysis framework that can be ported to a variety of platforms via code instrumentation at the source level and a pure software profiling toolkit. ParallelTracer is implemented 48
based on ANTLR which is a popular open source parser generator. IBM Cell processor is used as a case study to demonstrate the capability of ParallelTrace. ParallelTracer is capable of providing useful information for programmers to understand program behaviors. It can also identify potential performance bottlenecks via graphical visualization. To allow tracing across numerous embedded platforms, Trace Collection and Trace processing (TCPP) [66] was initially developed. TCPP was focusing in providing a portable source code instrumentation technique that provides performance monitoring codes offline. TCPP collect event traces by the injected codes and perform post-mortem performance visualization. TCPP is quite useful for embedded applications. The problem was that it did not support parallel applications. Prallel Tracer extends TCPP by allowing analysis of parallel applications that run on heterogeneous multicore systems. In addition to this Parallel Tracer also can perform instrumentation, trace collection, and trace visualization across multicore platforms. Let us see how the trace collection mechanism works in Parallel Tracer. Parallel Tracer is equipped with two sets of trace handlers to generate and store the traces on the Cell platform. One set collects trace events into the memory buffers while the other set store the trace events into the file system. When a multithreaded application is being executed, for each thread, the probes injected by the instrumentor call its own trace handler to collect trace events in parallel with other threads. PPE (Power Processing Element) runs the operating system. It also has direct access to the file system. Therefore Parallel Tracer has a special trace handler running on the PPE side to store the trace record into the file system. Both PPE and SPE s (Synergistic Processing Element) have access to the main memory. Therefore, it is very convenient to exchange the data through main memory for the two sets of trace handlers. 49
Figure 5.1: Trace Collection Architecture Fig. 5.1 depicts the mechanism designed for collecting traces on the Cell processor. This show the buffering mechanism used by Parallel Tracer used to smooth the trace collecting. This essentially allows the communication and computation operations to be overlapped thus reducing the execution overhead caused by tracing tool. Parallel Tracer also provides a visualization component that makes the profile information readable and comprehensive for programmers to identify performance bottlenecks and improve the programs in a timely manner. (ii)likwid [58] presents LIKWID ( Like I Knew What I m Doing ), a light weight, performance oriented tool suit for x86 multicore architectures. It is a set of easy to use command line tools that can be used to support optimization for performance oriented programming environments. LIKWID is developed for LINUX environment and it does not require any kernel patching, and is suitable for Intel and AMD processor architectures too. Another advantage in LIKWID is that it supports multithreaded and hybrid shared and distributed-memory parallel programs. LIKWID comprises of the following tools: 50
1. likwid-features This allows to display and alter the state of the on-chip hardware prefetching units in Intel x86 processors. 2. likwid-topology Is capable of capturing cache topology in multicore and multisocket nodes. This helps to optimize resource usage in parallel code, for example, shared caches and data paths, physical cores, and ccnuma locality domains. 3. likwid-perfctr This allows measuring the performance counter over a given time. This duration can either be the complete runtime of an application or between arbitrary points in the code. However, for the latter case, support from a simple API is neede. To allow the concurrent measurement of a large number of metrics (> number of available counters), counter multiplexing is employed. 4. likwid-pin This allows enforcing thread-core affinity in a multithreaded application without changing the source code. Therefore this is compatible with all POSIX threading based threading models and hybrid MPI with threads programming. This makes use of likwid-topology to get information on cache topology. Figure 5.2: Thread and cache topology of an Intel Nehalem EP multicore dual-socket node LIKWID only supports x86-based processors. An intimate level of knowledge is required the micro architecture of a processor and the code characteristics for the hardware specific optimization. Profiling is capable of solving many problems. However, sometimes to get a comprehensive view of the problem, more 51
information is needed. Therefore performance counters are used for this purpose. Performance counters can count hardware events during code execution on a processor. Performance counters are hardware solutions meaning that they are implemented directly in hardware. Therefore, there is no overhead involved. Almost all processors today provide hardware performance counters. However, the main purpose of these hardware counters is to support computer architects during the implementation phase. But they still provide an in-depth view of the things happen on the processor while running applications. There are two types of hardware performance counters: event counters and overflowing counters. Event counts are collected over the runtime of an application process. Overflowing hardware counters on the other hand can generate interrupts and therefore enables a very fine grained view on a code s resource requirements However, in LIKWID, likwid-perfctr uses the first option as this has zero overhead. Figure 5.3: likwid-perfctr The advantage in likwid-perfctr is that it can be used as a wrapper to an application. Figure 5.3 depicts the likwid-perfctr tool organization. As shown in the figure, it allows simultaneous measurements on multiple cores. Socket locks are used to support events that are shared among multiple cores of a socket. This enforce that all uncore event counts are assigned to one thread per socket. (iii) OProfile OProfile [59] is one of several profiling and performance monitoring tools for Linux. It works on various architectures, including the IA32, IA64, and AMD OProfile is capable identifying a hist of issues including loop unrolling, poor cache utilization, inefficient type conversion and redundant operations, branch mispredictions, etc.. It collects information about processor events including TLB miss, stalls, memory 52
references, total lines allocated in the DCU (Data Cache Unit), the number of cycles of a DCU miss, and the number of non-cacheable and cacheable instruction fetches. OProfile can collect samples for a set of instructions, or for function, system call, or interrupt handlers. OProfile uses sampling techniques. The performance problems can be identified using the collected profile data. For this purpose, it consists of a kernel driver and a daemon for collecting sample data, and several post-profiling tools for turning data into information. OProfile uses hardware performance counters to give profiling of a wide variety of statistics. These can also be used for basic time-spent profiling. OPrifile profiles all the code. For example hardware and software interrupt handlers, kernel modules, the kernel, shared libraries, and applications are profiled. OProfile has the advantage of post profile analysis, low overhead and call graph support. OProfile can support the display of information for multi-core processors. Following hardware counters can effectively be used for multicore profiling: 1. CPU CLK UNHALTED This increments once every clock period that the cpu is running. 2. DATA CACHE MISSES This increments for every memory reference that misses the L1 data cache. 3. MCT TO XBAR BUFFER FULL CYCLES This increments for every cycle that the memory controller (MCT) to crossbar (XBAR) buffer is full (i.e. can be generated by memory reads) 4. XBAR TO MCT BUFFER FULL CYCLES This increments for every cycle that the XBAR to memory controller (MCT) buffer is full. (i.e. can be generated by memory writes) (iv) Servet [60] presents a benchmark suit for auto-tuning multicore clusters. By auto tuning code the application performance can be optimized automatically depending on the machine on which they are executed. There 53
already exist tools for sequential computation using a wide search mechanism to find the most appropriate algorithm. Knowledge of some hardware characteristics can reduce the search time. There also exist many optimization techniques for parallel computing too. Among the different parallel architectures, clusters of multicores pose significant challenges, as they present a hybrid distributed and shared memory architecture with several hierarchies determined by non uniform communication latencies. Servet is a portable benchmarking suite that one can use to get relevant hardware parameters of clusters of multicores. Therefore Servet supports automatic optimization of parallel codes on multicores. The estimated parameters of Servet include cache size and hierarchy, bottlenecks in memory access and communication overheads. The knowledge of the cache size can effectively be used in many optimization techniques to divide the computation to blocks of data that fit in cache. This will minimize the number of cache misses and, therefore, increase the memory access performance. Figure 5.4: mcalibrator algorithm Figure 5.4 shows the mcalibrator algoritm. It outputs are two arrays S and C, of length n, containing the sizes of the traversed arrays and the average number of cycles required by each access during their traversal, respectively. 54
The memory access time can significantly be reduced using the knowledge of which cores share a concrete cache level. However, if two processes work with the same block of data which fits in cache, mapping them to cores which share cache would improve the performance. This is because they can exchange data using the cache. On the other hand, if they do not work with the same data, their working sets could not fit in a shared cache, leading to more replacements and misses. In this case scheduling techniques for auto tuning would map the processes to cores that do not share cache in order to minimize the misses. (v) ParMiBench [62] proposes an open source parallel benchmark suite called ParMiBench. This benchmark suite is targeted for embedded applications. PrarMiBench is a direct extensiuon of MiBench [70] which is a benchmark for embedded systems for uniprocessors applications. Several MiBench applications are also included in ParaMiBench. These are selected from the following application domains: automotive/industrial control, office, network, and security. The programming languages Pthread and standard C are used to implement the parallel version. The main performance measure in ParMiBench is speedup. Speedup of the parallel algorithms is compared against the sequential algorithms. [71] contains more description of ParMiBench and performance characterizations of the applications. As mentioned earlier ParMiBench is for multiprocessor-based embedded systems and it allows the performance to be measured mainly in speedup. Its structure refers to EEMBC and MiBench. One of the major advantages in this benchmark is that it is categorized based on applications. This basically enables the users of embedded systems market to examine their design more effectively for a particular market segment of embedded devices. The applications in ParMiBanch are capable of running on multiprocessor based embedded systems. This is because ParMiBench applications are simply the parallel versions of the same applications found in MiBench. Any ParMiBench application can run on Unix/Linux platforms that support Pthreads and C. All ParMiBench applications are compiled with the GNU Compiler Collection (GCC). 55
ParMiBench, workers access the input data that is read into memory. They then write the result into unique files. Writing to buffers basically allows to reduce input out put communication time. Static load balancing is then employed to distributed the work among workers equally. to reduce synchronization and communication overhead, coarse grained task decomposition has been used. The input data is partitioned among workers based on some fair data partition strategy. Some of the major advantages of ParMiBench include the low synchronization, communication overhead, ability to measure CPU and memory performance of a system. (v) JetBench [63] presents JetBench which is an open source multiprocessor benchmark that is used in multicore platforms. It is targeted for multithreaded application that is meant for shared memory architectures. JetBench is based on OpenMP, and therefore could be seamlessly ported to any platform supporting OpenMP. This is an application benchmark written in C, for real-time jet engines thermodynamic calculations. In JetBench, users can specify custom workload. These workloads could be a real flight profile with deadlines. The time consumed in calculating individual data points, and the miss of deadlines is recorded benchmark. Authors claim that JetBench is scalable to any number of cores. They also claim that operating system s scheduling characteristics can be measured using JetBench. The JetBench application contains thermodynamic calculations. As shown in Figure 5.5, the flight profile allows user to input Jetbench information such as speed, altitude, throttle, and deadline time, while in response to that, the processing time for various thermodynamic calculations is monitored and reported. Figure 5.5 : JetBench Application I/O parameters JetBench is a realistic representation of the actual workload. Following discuss some deviations one has to apply to allow portability of the application on various platforms. Implementing I/O requirements of a real 56
applications is too large to be implemented in the benchmark. This will necessarily lower the portability of the benchmark. However, due to this I/O performance of a platform can not be evaluated through JetBench benchmark. Also, an application may specify a target time period the application needs to finish execution. However, excessive computations could have caused the benchmark to perform poorly on majority of low end systems. JetBench avoids this problem by application covering a limited number of typical thermodynamic calculations used in jet engines. As a consequence of the restricted workload of the computations, it may seem small enough to high-end multicore systems that their actual performance may not be reported well, as in contrast to a low end multicore platform. JetBench application provides the user with an overview of the real time performance of the system. It also can be used to discover optimum number of threads to achieve desired performance. The JetBench benchmark is mainly comprised of ALU centric operations such as integer/double multiplication, addition, and division for the computation of exponents, square roots, and calculations such as value of pi and degree-to-radian conversion. All these operations are based on real thermodynamic equations and operations required for a jet engine control unit. The benchmark structure is composed of 88.6% of the parallel portion as reported by thread analysis tools. (vi) MultiBench [64] introduce EEMBC s MultiBench 1.0 which is a new benchmark suite for measuring the throughput of multiprocessor systems and h multicore processors. MultiBench seems to be moving away from published scores, closer to private testing. MultiBench is mainly used as a primarily analysis tool for internal testing and closed door sales pitches. The basic requirement of industry is a technically valid method for evaluating the performance of multicore processors. Within its limitations, MultiBench meets that need. Figure 5.6 : MultiBench 1.0 WorkLoads and WorkItems 57
Many tasks in MultiBench are actually adapted from existing EEMBC suites. Some new tasks are also included. Multicore embedded processors typically perform the same application-level tasks as single-core embedded processors. Therefore Multibench is implemented using reusing existing EEMBC suits code, saving valuable development time. Early EEMBC benchmark tasks in each suite are reffered to as kernels. A kernel is an algorithm or routine that performs a common task found in real-world embedded software. However, in MultiBench tasks called workloads. A workload may include one or more work items. Work items are similar to the kernels of old, because they mate algorithms with sample datasets on which the algorithms operate. MultiBench workload 64M-cmykw2 is one such example workload. It consists of a single work item, a color-conversion routine that transforms four 12-megapixel images from the RGB color space to the CMYK color space. 64M-rotatew2 is yet another single item MultiBench workload which is an image-rotation routine that operates on four 12-megapixel images, turning each one 90 degrees clockwise. Another popular MultiBench workload is 64M-cmykw2-rotatew2. It is a workload which combines the color-conversion and rotation workloads to create a two-item workload. In all, MultiBench 1.0 has 36 workloads, some of which are combinations of work items in other workloads. Using MultiBench users are also allowed to create their own custom workloads by selectively choosing work items and changing the parameters of the items. These workitems are capable of work on changeable threading levels and can operate on different datasets. Users are allowed to input their own datasets. Custom workloads aren t valid for MultiBench scoring, but they allow testers to create a virtually infinite variety of workloads, even when using the standard EEMBC datasets. Testers can exercise processors and systems to identify strengths and weaknesses or to compare the performance of multicore processors with that of single-core processors. Users can test the effectiveness of multithreading at different levels: timesliced on a single processor or distributed among multiple processors. 58
6. CONCLUSION AND FUTURE CHALLENGES 6.1. Conclusion A significant performance advantage as well as improved power consumption has been observed in multicores in recent years. The relationship between clock rate and power consumption, coupled with a limited ability to handle dissipated heat, means that performance improvements are now coming in the form of a growing number of parallel cores instead of an increased clock rate. Moore s Law has been effectively used to predict that there will be 100 multicore chips within a decade. Such larger number of cores per chip introduces significant architectural challenges. One of the major challenges introduces with such large number of chips is the challenge of low power consumption. An average core commercially available nowadays, consumes 10's of watts. However, the problem is even with 10 watts per core, 1000 cores in a chip itself need 1000 watts. Therefore careful design needs to be done, especially when increasing amount of resources available to a chip so that performance gain is more than increase in the core area. Researches over the past recent years have proposed numerous approaches for programming the diverse multicore hardware platforms. One such approach for handling homogenous multicore nodes is to apply current distributed-memory programming models to the individual cores. There also exist various sharedmemory programming models. Pthreads and OpenMP are two of the most popular, which is a more recent effort is Intel s Thread Build Blocks (TBB). For heterogeneous multicore nodes there exists a variety of architectures as well as many programming models. One of the major limitations in all these approaches is that they are vender dependent. Recently however there have been attempts to define a programming environment which are usable in a variety of diverse architectures, such as multicore CPUs, GPUs and the STI Cell. For example, there is a proprietary Multicore Development Platform which provides a single C++ programming interface capable exploiting multiple back-ends. OpenCL is yet another effort that has defined an open standard for programming multicore processors. With backing from numerous hardware and software vendors, this framework consists of a new language (based on C) for writing parallel kernels. 59
Many embedded system designers are still struggling to determine whether multicore really buys them anything in terms of performance. Therefore there must be a method to analyze performance of multicore to provide reliable performance information. Measuring multicore performance requires new ways of benchmarking. Once new such platforms are devised, there need to be new methods for interpreting the results. Further, for a benchmark to be relevant for multiple cores and produce comparable results, it must be able to execute the same amount of work regardless of the number of contexts used. The benchmark should be able to show the performance improvement that results from the number of contexts used. There are numerous benchmarks available in the market today. Many traditional benchmarks developed for multiprocessor are equally applicable to multiprocessors. There are also new benchmarks targeted for multicore community such as MultiBench and ParMiBench. During this survey project we have gone through many areas associated with single chip multicore systems. We have gathered a significant amount of knowledge regarding the evolution of multicore and also the current and future research in this area. In a decade earlier multicore systems were only used for scientific and compute intensive works by researchers. But nowadays even simple desktop machines are composed of multicore processors. Generally growth of a processor followed Moore s Law. It was followed till 2002 by employing Instruction level parallelism, VLIW and pipelining but after then even these techniques were not able to give the growth suggested by Moore s law. So, finally scientists combined two or more processors to develop multicore processors. Multicore processors give a considerable increase in the performance than the existing system by using multitasking and thread level parallelism. Although multicore architectures bring a boost in the performance it also has some challenges. Among all these challenges software challenge has the most significant impact. Till the advent of these processors, developers were used to write sequential code but sequential code couldn t give the expected performance when run on multicores. Parallel programming paradigm was needed to exploit the functionality of multicore processors. Many parallel programming languages were developed for doing successful programming on multicore platforms but the software community or application developers are still not very used to these language. Also it is quite difficult to write and debug a multithreaded program on a parallel language platform. This fact makes the parallel programming concepts less attractive to the software community but it is expected that these 60
challenges will be resolved soon because multicore is a very active research area. A huge amount of research is going all around the world to overcome the current challenges in multicores so that it can be widely accepted in future. A dual core and quad core processors are common now and in future we are expecting tri, hexa, octo core systems to systems with tens to thousand cores which can successfully perform multithreaded parallel programs. 6.2. Future Challenges 6.2.1 Software Challenges The multi-core processor is truly very common these days. Almost every new laptop or desktop contains one or more of them. The main reason of this shift from single to multicore devices is hardware designs encountering the technical and physical limits of semiconductor design. Multicore processors promise a huge improvement in performance in comparison with single processors due to their ability to work in parallel. But this improvement brought a new challenge with it because now use developers had to handle the concurrency. It is realized long back that only parallel hardware is not sufficient to utilize the parallel ability of a multicore processor. We need parallel programming paradigm along with hardware to exploit the parallelization capability of the multicore chip. The way we achieve parallelism with today s operating systems is to have multiple processes or threads capable of running concurrently. In these operating systems either the programmer handles the task (thread) scheduling, or controls the scheduling through operating system calls and objects such as mutexes, events, and locks. This places the responsibility for thread or task synchronization on the programmer, adds to the difficulty of the programming effort, and leads to many problems. A few of these problems are [39]: Reentrancy: All code shared between more than one task must be reentrant that means it can be called again before its previous execution is over. Deadlocks: Deadlocks happen when two or more tasks become blocked because each is waiting on a resource that another has. 61
Livelocks: Livelocks are similar to deadlocks with a subtle difference that it does not cause the tasks to be blocked but they can not even continue to do useful work since they require resource from other tasks. Race condition: Race condition occurs when two or more tasks want to access the same block of code at the same time. Under these circumstances, the result of a computation will depend on which task executing the code snippet gets to a certain point first. The probability of this happening increases with the number of cores. Synchronization: High performance parallelized applications running on multicore processors require very efficient synchronization mechanism. Traditional synchronization techniques such as Spin Lock use Busy-Wait kind of states to enforce Mutual Exclusion. Many new technologies are evolved which provide improved synchronization mechanisms. One of the new techniques is described in [40]. This mechanism is called Synchronization Handler (SH) which is specially targeted for multiprocessor architecture. Fig. 6.1 shows the architecture of the SH in a 16 node 2D mesh system which is interconnected via packet switched network. Each of these nodes contains a processor, a local memory, a Network Interface (NI) and a Synchronization Handler (SH). The SH consist of a set of synchronization variables and two physical buffers to provide efficient synchronization support for mutual exclusion. These two physical buffers can simultaneously handle two simultaneous requests from local and remote processors. Each buffer owns a buffer queue to store the incoming requests and respond to the appropriate request. These physical buffers are organized as multiple virtual buffers; each of these virtual buffers is associated with a lock. Since each virtual buffer has its own lock, requests running on different locks can run independently without interfering with each other. The requests are differentiated into two types: Lock Acquire and Lock Release. The Lock Release request always goes along the bypass path, while the Lock Acquire request also can go along the bypass path only when the related virtual buffer is empty. That is to say, the virtual buffers are only used for the Lock Acquire requests. 62
Fig 6.1 Architecture of Synchronization Handler Fig 6.2 (left) Buffer s dynamic allocation and virtual buffers organization, and (right) an example describing how the two physical buffers are dynamically and organized as multiple virtual buffers Fig. 6.2 (left) shows synchronization requests come from the local processor and the network go through the corresponding virtual buffers and finally enter the Synchronization Variable Pool. In Fig. 6.2 (right) we can see an example scenario where requests #1, #5, #6 acquire Lock 1. Requests #3 and #7 acquire Lock 2 and requests #4, # 2 acquire Lock N. These requests are reordered according to the locks they acquire. Hence, Request 1, 5 and 6 logically form Virtual Buffer 1, Request 3 and 7 logically form Virtual Buffer 2, and Request 2 and 4 logically form Virtual Buffer N. Experiments are carried out to evaluate the Buffer Utilization and the results are satisfactory showing this technique improves the Buffer Utilization to a great extent. Load Balancing: The load-balancing control unit in the processor assigns an available core to a task. Whenever there is more than one available core to execute a task, we have to decide which one to choose. This decision greatly affects the power and/or thermal behavior of the core die area. Faulty Load Balancing techniques can increase the power consumption to great deal and can also generate 63
hotspots. Therefore high demand exists for a simple, yet effective, load-balancing mechanism to assign tasks to cores. The traditional methods are Round Robin (RR) and Lower index First method (LIF). In RR the load in evenly distributed across all the cores and in LIF the load is concentrated in the nodes with smaller index. This uneven distribution in the later approach sometimes generates Hot Spots. Hot Spots are not desirable because they reduce the reliability of the processor since the cores on the hot spot area suffer less wear, thus increasing the processor s lifetime. The LIF load balancing technique has the drawback of generating a hot spot in the core pool region, whereas the RR technique does not, while both draw the same power consumption and performance. However, RR technique exhibits a significant performance penalty due to latency of waking up the core. In RR, the core is put to sleep right after it has finished a task, it may happen that, when it has to execute the next task, the time that it has been in the Off state is too short to recoup the power penalty of waking up. This results in high power consumption. A new algorithm is proposed in [40] to mitigate the problems in RR and LIF. This new technique namely Waiting Idle First (WIF) gives two fold benefit of reducing the probability of Hotspot generation as in RR and provide low performance penalty as LIF. The main idea of WIF is whenever a core finishes execution of a task, its core index number is noted down, and used as the starting point for the next time a new available core needs to be searched. The implementation is very similar to RRB and LIF, and does not add significant area to the load-balancing unit. The goal of WIF is two fold: (a) try to use the most recent core that has completed a task so that, hopefully, it is still in the idle state so that we don t have to wake it up, and (b) follow a round-robin policy to evenly distribute the power density when the most recently finished core is not available. 6.2.2 Programmer s Challenges To gain the full performance from the multicore architecture, the programming paradigm also has to move from sequential to parallel. It is not easy for a programmer to take advantage of multi processor cores. Effective parallelization requires the developer to determine the potential parallelism in an application and then package that parallelism into multiple threads of execution such that full parallelism is exploited.this process is highly architecture-dependent. A parallel program optimized for one particular system will not be efficient on more than a few highly similar systems. Unfortunately, multicore systems are more diverse 64
and have a more widely spread design spectrum which makes development of any cross platform applications in multicores even more difficult. Multicore diversity can be categorized into architectural diversity and environmental diversity [41]. Multicore systems architecture has a wide variety in the number of cores, core complexity, cache hierarchy, memory bandwidth and internal heterogeneity of the cores. Architectural and environmental diversity will make it impossible to determine the strategy to solve an application in advance. Overcoming diversity will be a nightmare for developers who intend to deploy efficient parallel programs on multicore platforms. 6.2.3 Hardware challenges I/O Coherency Maintaining coherency between the cores, caches and the data generated or consumed by I/O devices is challenging. The I/O subsystem is a critical part of any system single or multiprocessor. A multicore system which consumes process and provide bulk amount of data is very common. Consequently, the communication mechanism used to pass data between the I/O peripheral devices and on-chip cores is gaining more and more importance. Most mechanisms for passing data between I/O devices and a CPU implement the classic producer-consumer model for coherent communication. Though this model is straight forward and easy to implement but as the system complexity increases maintaining coherency with this simple model becomes difficult. Multicore systems introduce multiple levels of write-back caches, hierarchical system buses, and multiple cores which create more places where data can be temporarily stored, making it more difficult to ensure data consistency and coherency [45]. I/O data coherence refers to maintaining coherency and consistency of data passed between an I/O peripheral device and one or more CPU cores. The data can flow in either direction: an I/O device produces data and the CPUs consume it or vice versa. Generally coherency is maintained either by Software Coherency or Hardware Coherency. 65
Fig 6.3 Software coherency architecture In Software I/O coherent systems, all the L1 level caches are connected to coherence manager and in turn to an optional L2. These are connected to I/O devices through memory interconnects. In step 1, the CPU writes the data into memory, the L1 cache, or the L2 cache. In step 2, the CPU makes the data visible to the I/O devices by forcing the data into main memory. In step 3, the CPU sets a flag by writing a location in main memory or a register on the I/O device. Next, in step 4, the I/O device recognizes that the flag is set and finally, the I/O device reads the data from main memory in step 5. Therefore, although coherence manager manages the coherency between the CPU cores and L1 caches, it is software s responsibility to maintain coherence between I/O data and L1 cache[45]. Hardware I/O system reduces software s burden to maintain coherency. Moreover in some hardware I/O coherent systems, data can pass to the I/O through L2 caches which result an increase in performance. Fig. 6.4 Hardware coherency Architecture 66
In hardware coherent systems, reads and writes from an I/O device occur in a coherent manner as they re routed through the I/O interconnect to the coherence manager and into the optional L2 cache or to main memory. It is not the coherence manager s duty to keep I/O transactions coherent with L1 caches. Without L1 cache coherency, the coherence manager redirects I/O transactions directly to the L2 cache, and software must manage the L1 cache coherence. However, because both CPUs and I/O devices access the L2 cache, coherence is implicitly maintained for data stored there. Hardware I/O coherence with L1 caches reduces the software overhead of explicitly flushing the L1 caches. Cache coherency Mostly in all the recent multicore architectures, multilevel cache is used. In this kind of architecture, all the cores have their private caches for faster access and better performance. This benefit brings another challenge that is keeping coherency in all these private caches. So that whenever one core can change an address block in its private cache block, all the other cores also update their copies of that cache block (if available with them). Currently snooping and directory based protocols are the most popular methods used to maintain coherency in caches. We have reviewed another hardware based cache coherency scheme which is given in [46]. This scheme is called as Write Broadcast Protocol or OMPCC (0akland Multiple Processor Cache Coherency). The difference of this scheme is that it uses two different types of controllers, Private cache controller and Snooping cache controller. The private cache controller along with the private caches forms the Private cache system and the snoop cache memory along with snoop controllers form snoop cache system. These subsystems are connected with each other by different buses like coherency bus, process bus. This protocol has four stages:-single-unchanged, Single-Changed, Multiple-Unchanged and Multiple-Changed. Single-Unchanged: This is the only copy of the memory block in any of the caches and it is identical to the block in the memory. Single-Changed: This is the only copy of the memory block in any of the caches but is different from the corresponding block in the memory. Multiple-Unchanged: There are many copies identical to a memory block exist in many caches. Multiple-Changed: There are several copies exist in as many caches. These copies are identical to one another but are different from the memory. The main advantage of this scheme lies in the fact that it uses two level caches and coherence mechanisms which reduces the bottleneck created by single cache satisfying both cache requests and snooping requests. The use of coherency bus also helps main memory bus to save bypassing the time consuming coherency 67
operations. As other coherency protocols it also has four operations: Read Hit, Read Miss, Write Hit, Write Miss. The state of the cache changes depending on what kind of operation has been performed. As an example, when a Read Miss occurs, the snoop controller places broadcast request on the coherency bus and The block is also written on both the PCC and SCC of the requesting processor with the state Multiple- Unchanged or Multiple-Changed. Interconnect Interconnect is an integral part of any multicore system architecture as it connects all the components on the chip. It is very important to have a deep understanding on the interconnect framework architecture and how interconnect works with the cores. It is difficult to understand interconnect in cores on same chip than multichip modules because a huge number of factors like power, area, latency, and bandwidth are considered in the design of the former one. Furthermore, the design of interconnects are dependent on the design of the core, cache to a great extent. A hierarchical interconnect model is proposed in [26]. This architectural model consists of Shared Fabric Bus (SBF) which connects the module and Point to Point connection to connect two SBFs. Each SBF in turn consists of Address Bus, Snoop Bus, Data Bus, and Response Bus. A typical communication goes through the following steps. Fig 6.5 Architecture of the Shared Fabric Bus Each core here is considered to have its own L2 caches. Therefore only the load that misses in the L2 cache will enter the shared bus fabric to be serviced by other cores. First, the requester (in this case, one of the cores) will signal the central address arbiter that it wants to transmit a request. It then sends the request over 68
an address bus (AB in Figure 6.5) only upon its grant is accepted. Requests are extracted from the end of the address bus and put in a snoop queue, waiting to get the access of snoop bus (SB). Transactions placed on the snoop bus make each snooping node to place a response on the response bus (RB). Logic and queues at the end of the response bus gather these responses from the other nodes and generate a broadcast message that goes back over the response bus identifying the action each involved component (e.g., source the data, change coherence state) should take. Finally, the data is sent over a bidirectional data bus (DB) to the original requester. If there are multiple SBFs (e.g., connected by a P2P link), the address request is broadcast to the other SBFs via the P2P link, and a combined response from the remote SBF returned to the local one, to be merged with the local responses. Another sensitive issue related with interconnect is temperature control. As the number of cores increases in the multiprocessor systems, the amount of heat generated is also increased. This increase in on chip heat can cause longer interconnection delays which in turn degrade the chip performance [47]. The power consumed by on-chip interconnection networks is translated into heat which affects both the underlying silicon and metal layers and can have very harmful effect on the chip and reduces its reliability. So, there is a great need to constantly monitor the amount of heat generated on a chip. 69
7. REFERENCES [1] Victor Pankratius, Michael Philippsen, New Horizons in Multicore Software Engineering, Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, 2010 [2] Survey of Formal Models of Computation for Multi-CoreSystems,Valentina Zadrija, Technical Report 03/31/2009 [3] Gajski, D. D., Gerstlauer, A., Abdi, S., Schirner, G.,Embedded System Design, Modeling, Synthesis,Verification. Kluwer Academic Publishers, 2009 [4] FreeScale Semiconductor, Embedded Multicore: An Introduction, 2009 [5] Geoffrey Blake, Ronald G. Dreslinski, and Trevor Mudge, A Survey of Multicore Processors, IEEE SIGNALPROCESSING, 2009 [6] Advanced Micro Devices Inc. Key architectural features AMD Phenom II processors, AMD Product Information, 2008 [Online]. Available: http://www. amd.com/usen/processors/productinformation/0,,30_118_15331_15917%5e15919,00.html [7] Advanced Micro Devices Inc., Software optimization guide for AMD family 10h processors, AMD White Papers and Technical Documents, Nov. 2008 [Online]. Available: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf [8] Intel Corp., Intel core i7-940 processor, Intel Product Information, 2009 [Online]. Available: http://ark.intel.com/cpu.aspx?groupid=37148 [9] Intel 64 and IA-32 Architectures Software Developer s Manual, Intel Developer Manuals, vol. 3A, Nov. 2008 [10] Sun Microsystems Inc., UltraSPARC T2 processor, Sun Microsystems Data Sheets, 2007 [Online]. Available: http://www.sun.com/processors/ultrasparct2/datasheet.pdf [11] T. Johnson and U. Nawathe, An 8-core, 64-thread, 64-bit power efficient sparc soc (niagara2), in Proc. 2007 Int. Symp. Physical Design ISPD 07. New York, NY: ACM, 2007, pp. 2 2. [12] Intel Corp., Intel atom processor for nett op platforms, Intel Product Brief, 2008 [Online]. Available: http://download.intel.com/products/atom/319995.pdf [13] ARM Ltd., The ARM Cortex-A9 Processors, ARM Ltd. White Paper, Sept. 2007 [Online]. Available: http://www.arm.com/pdfs/armcortexa-9processors.pdf [14] D. May, XMOS XS1 architecture, XMOS Ltd., July 2008 [Online]. Available: http://www.xmos.com/files/xs1-87.pdf [15] Advanced Micro Devices Inc., ATI Radeon HD 4850 & ATI Radeon HD 4870 GPU specifications, AMD Product Information, 2008 [Online]. Available: http://ati.amd.com/products/radeonhd4800/specs3.html [16] NVIDIA Corp., NVIDIA CUDA: Compute unified device architecture, NVidia CUDA Documentation, June 2008 [Online]. Available: http://developer. download.nvidia.com/compute/cuda/2_0/docs/nvidia_cuda_programming_guide_2.0.pdf [17] Intel QuickPath Interconnect, http://www.intel.com/technology/quickpath/introduction.pdf [18] R. Merritt, X86 Cuts to the Cores, EETimes Online, September 2007, http://www.eetimes.com/showarticle.jtml?articleid=202100022 [19] Alexandra Fedorova, Sergey Blagodurov, and Sergey Zhuravlev. 2010. Managing contention for shared resources on multicore processors. Commun. ACM 53, 2 February 2010 [20] S. Cho and L. Jin. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 455 468, 2006 [21] Harold S. Stone, John Turek, Joel L. Wolf, Optimal Partitioning of Cache Memory, Journal IEEE Transactions on Computers Volume 41 Issue 9, September 1992 [22] Gabriel H. Loh, Extending the Effectiveness of 3D-Stacked DRAM Caches with an Adaptive Multi-Queue Policy, Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009 [23] A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. S. Jr., and J. Emer. Adaptive Insertion Policies for Managing Shared Caches. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, Brasov, Romania, September 2007. 70
[24] A.C. Sodan, Jacob Machina, Arash Deshmeh, Kevin Macnaughton, Bryan Esbaugh Parallelism via Multithreaded and Multicore CPUs IEEE Digital Object Indentifier 10.1109/MC.2009.377 [25] IBM Blue Gene team, Overview of the IBM Blue Gene/P project, IBM Journal of Research and Development, 52(1-2), January-March 2008, pp. 199-220. [26] R. Kumar, V. Zyuban, and D.M. Tullsen, Interconnections in Multicore Architectures: Understanding Mechanisms, Overheads, and Scaling,Proc Int l Symp. on Computer Architecture (ISCA05), 2005.Proc. Int l Symp. on Computer Architecture (ISCA05), 2005. [27] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr., and J. Emer. Adaptive insertion policies for high-performance caching. In ISCA-34,2007. [28] M.D. Hill and M.R. Marty, Amdahl s Law in the Multicore Era, IEEE Computer, July 2008, pp. 33-38. [29] Multicore Programming with LabVIEW http://zone.ni.com/devzone/cda/tut/p/id/6099 [30] Programming Strategies for Multicore Processing: Task Parallelism http://zone.ni.com/devzone/cda/tut/p/id/6420 [31] Programming Strategies for Multicore Processing: Data Parallelism http://zone.ni.com/devzone/cda/tut/p/id/6421 [32] Programming Strategies for Multicore Processing: PipeLining http://zone.ni.com/devzone/cda/tut/p/id/6425 [33] Dominic A. Orchard,Max Bolingbroke,Alan MycroftYpnos: Declarative, Parallel Structured Grid Programming,DAMP 10, January 19, 2010, Madrid, Spain. [34] Hahn Kim and Robert Bond Multicore Software TechnologiesIEEE SIGNAL PROCESSING MAGAZINE,november 2009 [35] Cilk 5.4.6 Reference Manual,Supercomputing Technologies Group MIT Laboratory for Computer Science, http://supertech.lcs.mit.edu/cilk [36] NVIDIA [37] CUDAProgrammer sguide.http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/ NVIDIA_CUDA_Programming_Guide_2.3.pdf [38] Intel Threading Building Blocks (Intel TBB) 2.2 In-Depth [39] Ronald Goodman, Scott Black Design Challenges for Realization of the Advantages of Embedded Multi-Core Processors IEEE AUTOTESTCON 200, Salt Lake City, UT. [40] Xiaowen Chen, Zhonghai Lu, Axel Jantsch and Shuming Chen Supporting Efficient Synchronization in Multi-core NoCs Using Dynamic Buffer Allocation Technique ISVLSI '10 Proceedings of the 2010 IEEE Annual Symposium on VLSI. [41] Enric Musoll, A thermal-friendly load-balancing technique for multi-core processors, ISQED '08 Proceedings of the 9th international symposium on Quality Electronic Design [42] Rajagopal Nagarajan, Multicore technologies and software challenges available online at http://www.eetimes.com/design/embedded/4008860/multicore-technologies-andsoftware-challenges?pagenumber=1 [43] David A. Penry: Multicore diversity: a software developer's nightmare. Operating Systems Review 43(2): 100-101 (2009) [44] Liang Peng ; Mingdong Feng ; Chung-Kwong Yuen ; Evaluation of the performance of multithreaded Cilk runtime system on SMP clusters Cluster Computing, 1999. Proceedings. IEEE Computer Society International Workshop [45] Berg, T.B Maintaining I/O Data Coherence in Embedded Multicore Systems Micro, IEEE, May- June 2009, Issue:3, Vol 29 [46] Raja, P.V. ; Ganesan, S.; A hardware cache coherency scheme for multiprocessors Circuits and Systems, IEEE 1993., Proceedings of the 36th Midwest Symposium [47] Vaddina, K.R. ; Liljeberg, P. ; Plosila, J. ;Thermal analysis of on-chip interconnects in multicore systems IEEE, NORCHIP, 16-17 Nov. 2009 [48] G. E. Suh et al. Dynamic partitioning of shared cache memory. Journal of Supercomputing, 28(1), 2004. [49] Moinuddin K. Qureshi, Yale N. Patt, Utility-Based Cache Partitioning: A Low-Overhead, High- Performance, Runtime Mechanism to Partition Shared Caches, in MICRO 39 Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture 2006 71
[50] Nan Guan1, Martin Stigge1, Wang Yi1 and Ge Yu, Cache-Aware Scheduling and Analysis for Multicores, EMSOFT '09 Proceedings of the seventh ACM international conference on Embedded software, 2009 [51] John M. Calandrino, James H. Anderson, On the Design and Implementation of a Cache-Aware Multicore Real-Time Scheduler, ECRTS '09 Proceedings of the 2009 21st Euromicro Conference on Real-Time Systems, 2009 [52] Allan Snavely,Dean Tullsen, Symbiotic Jobscheduling for a Simultaneous Multithreading Processor, Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IX, November 2000 [53] Inchoon Yeo, Chih Chun Liu, Eun Jung Kim, Predictive Dynamic Thermal Management for Multicore Systems, DAC '08 Proceedings of the 45th annual Design Automation Conference, 2008 [54] Anita Lungu, Pradip Bose, Daniel J. Sorin, Steven German, and Geert Janssen,Multicore Power Management: Ensuring Robustness via Early-Stage Formal Verification, MEMOCODE '09 IEEE conference on Formal Methods and Models for Co-Design, 2009. [55] Xuan Qi and Dakai Zhu, Power Management for Real-Time Embedded Systems on Block- Partitioned Multicore Platforms, ICESS2008, IEEE 2008 International Conference on Embedded Software and Systems,2008 [56] Gal-On, S.; Levy, M.;, "Measuring Multicore Performance," Computer, vol.41, no.11, pp.99-102, Nov. 2008 [57] Shih-Hao Hung; Chia-Heng Tu; Thean-Siew Soon;, "Trace-based performance analysis framework for heterogeneous multicore systems," Design Automation Conference (ASP-DAC), 2010 15th Asia and South Pacific, vol., no., pp.19-24, 18-21 Jan. 2010 [58] Treibig, J.; Hager, G.; Wellein, G.;, "LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments," Parallel Processing Workshops (ICPPW), 2010 39th International Conference on, vol., no., pp.207-216, 13-16 Sept. 2010 [59] Barker, D.P.;, "Realities of Multi-Core CPU Chips and Memory Contention," Parallel, Distributed and Network-based Processing, 2009 17th Euromicro International Conference on, vol., no., pp.446-453, 18-20 Feb. 2009 [60] Gonzalez-Dominguez, J.; Taboada, G.L.; Fraguela, B.B.; Martin, M.J.; Tourio, J.;, "Servet: A benchmark suite for autotuning on multicore clusters," Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, vol., no., pp.1-9, 19-23 April 2010 [61] Nguyen Due Chinh; Kandasamy, E.; Lam Yoke Khei;, "Efficient Development Methodology for Multithreaded Network Application," Research and Development, 2007. SCOReD 2007. 5th Student Conference on, vol., no., pp.1-5, 12-11 Dec. 2007 [62] Iqbal, Syed Muhammad Zeeshan; Liang, Yuchen; Grahn, Hakan;, "ParMiBench - An Open- Source Benchmark for Embedded Multiprocessor Systems," Computer Architecture Letters, vol.9, no.2, pp.45-48, Feb. 2010 [63] Müller-Schloer, Christian, Karl, Wolfgang, Yehia, Sami, Qadri, Muhammad, Matichard, Dorian, JetBench: An Open Source Real-time Multiprocessor Benchmark, Architecture of Computing Systems - ARCS 2010, LNCS 5974, pp. 211 221, 2010 [64] Tom R. Halfhill, EEMBC S MULTIBENCH ARRIVES, EEMBC Press Release, 2008 [65] Pitter, C.; Schoeberl, M.;, "Performance evaluation of a java chip-multiprocessor," Industrial Embedded Systems, 2008. SIES 2008. International Symposium on, vol., no., pp.34-42, 11-13 June 2008 [66] Hung, S.-H., Huang, S.-J., and Tu, C.-H., "New Tracing and Performance Analysis Techniques for Embedded Applications", in Proceedings of the 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, 2008, pp. 143-152. [67] Erlin Yao, Yungang Bao, Guangming Tan and Mingyu Chen: Extending multicores in multicore era,acm SIGMETRICS Performance Evaluation Review, Volume 37 Issue 2, September 2009 [68] David A. Bader, Varun Kanade and Kamesh Madduri SWARM: AParallel Programming Framework for Multicore Processors Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International [69] Frank Otto, Victor Pankratius, and Walter F. Tichy, High-Level Multicore Programming with XJava 31st International Conference on Software Engineering 72
[70] M.R. Guthaus et al., MiBench: A free, commercially representative embedded benchmark suite, Proc. of the IEEE Int l Workshop on Workload Charac-terization (WWC-4), Dec. 2001. [71] Y. Liang and S.M.Z. Iqbal, OpenMPBench - An Open-Source Benchmark for Multiprocessor Based Embedded Systems, Master thesis report MCS-2010:02, School of Computing, Blekinge Institute of Technology, Sweden, Jan. 2010. [72] L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In ISCA- 27, 2000 [73] L. Hammond, B. A. Nayfeh, and K. Olukotun. A single-chip multiprocessor. IEEE Computer, 30(9), 1997 [74] IBM. Power4:http://www.research.ibm.com/power4 [75] IBM. Power5: Presentation at microprocessor forum. 2003 [76] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg, and D. Lindqvist. Network on chip: An architecture for billion transistor era. In IEEE NorChip Conference, Nov. 2000. [77] W. J. Dally and B. Towles. Route packets, not wires: On-chip interconnection networks. In DAC- 38, pages 684 689, 2001. 73