The Benefit of SMT in the Multi-Core Era: Flexibility towards Degrees of Thread-Level Parallelism

The enefit of SMT in the Multi-Core Era: Flexibility towards Degrees of Thread-Level Parallelis Stijn Eyeran Lieven Eeckhout Ghent University, elgiu Stijn.Eyeran@elis.UGent.be, Lieven.Eeckhout@elis.UGent.be Abstract The nuber of active threads in a ulti-core processor varies over tie and is often uch saller than the nuber of supported hardware threads. This requires ulti-core chip designs to balance core count and per-core perforance. Low active thread counts benefit fro a few big, highperforance cores, while high active thread counts benefit ore fro a sea of sall, energy-efficient cores. This paper coprehensively studies the trade-offs in ulti-core design given dynaically varying active thread counts. We find that, under these workload conditions, a hoogeneous ulti-core processor, consisting of a few highperforance SMT cores, typically outperfors heterogeneous ulti-cores consisting of a ix of big and sall cores (without SMT), within the sae power budget. We also show that a hoogeneous ulti-core perfors alost as well as a heterogeneous ulti-core that also ipleents SMT, as well as a dynaic ulti-core, while being less coplex to design and verify. Further, heterogeneous ulti-cores that power-gate idle cores yield (only) slightly better energyefficiency copared to hoogeneous ulti-cores. The overall conclusion is that the benefit of SMT in the ulti-core era is to provide flexibility with respect to the available thread-level parallelis. Consequently, hoogeneous ulti-cores with big SMT cores are copetitive highperforance, energy-efficient design points for workloads with dynaically varying active thread counts. C.. [Processor Ar- Categories and Subject Descriptors chitectures]: Parallel Architectures Keywords Chip Multi-Core Processor; SMT; Single-ISA Heterogeneous Multi-Core; Thread-Level Parallelis Perission to ake digital or hard copies of all or part of this work for personal or classroo use is granted without fee provided that copies are not ade or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for coponents of this work owned by others than ACM ust be honored. Abstracting with credit is peritted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific perission and/or a fee. Request perissions fro perissions@ac.org. ASPLOS, March,, Salt Lake City, Utah, USA. Copyright c ACM 978----//... $.. http://dx.doi.org/./9.9. Introduction The nuber of active threads in a processor varies over tie, and is often (uch) saller than the nuber of available hardware thread contexts. This observation has been ade across different application doains. Desktop applications exhibit a liited aount of thread-level parallelis, with typically only to active threads []. Datacenter servers are often underutilized and seldoly operate near their axiu utilization; they operate ost of the tie between to percent of their axiu utilization level []. Even parallel, ulti-threaded applications do not utilize all cores all the tie. Threads ay be waiting because of synchronization priitives (locks, barriers, etc.) and ay yield the processor to avoid active spinning [6]. Finally, in a ultiprograed environent, jobs coe and go, and hence, the aount of available thread-level parallelis varies over tie. Workloads with dynaically varying active thread counts iply that ulti-core chip designs should balance core count and per-core perforance. Few high-perforance cores are beneficial at low active thread counts, while a sea of energyefficient cores are preferred at high active thread counts. The key question is what processor architecture is best able to deal with dynaically varying degrees of thread-level parallelis. A heterogeneous single-isa ulti-core with a few big cores and any sall cores [9], ight schedule threads onto the big cores in case there are few active threads, and only schedule threads on the sall cores when the nuber of active threads exceeds the nuber of big cores. A conventional hoogeneous ulti-core with Siultaneous Multi-Threading (SMT) cores [] ight schedule threads across the various cores if there are fewer active threads than cores. Each thread would then have an entire core to its disposal, and only when the nuber of active threads exceeds total core count, could one engage SMT to iprove chip throughput. Ideally, core count and size should be dynaically changed depending on the nuber of active threads, and people have proposed to fuse sall cores to bigger cores as a function of the nuber of active threads [, 7]. Deterining the appropriate processor architecture is not only iportant in the context of delivering high perforance under

various workload conditions, it also involves other design concerns such as power/energy as well as cost to design and verify the design, i.e., a heterogeneous and core fusion processor architecture is likely ore costly to design and verify than a hoogeneous ulti-core. This paper studies ajor ulti-core design trade-offs in the face of dynaically varying degrees of available threadlevel parallelis. Through a set of coprehensive experients, we find that a hoogeneous ulti-core with big SMT cores outperfors heterogeneous designs, under the sae power envelope, when there is a varying degree of threadlevel parallelis, for both ulti-progra and ulti-threaded workloads. The intuition is that when there are few active threads, they can be scheduled across the available big cores with a few, or even a single, SMT hardware thread contexts active, and hence achieve good single-thread perforance. We also find that a hoogeneous ulti-core with SMT perfors alost as well as a heterogeneous design that also exploits SMT, and that its perforance is also close to that of a dynaic ulti-core design, in which the configuration (nuber of big and sall cores) can change dynaically depending on the nuber of active threads. The result that the perforance of a hoogeneous ulticore with big SMT cores is coparable to a heterogeneous ulti-core design, we believe, is counter-intuitive. It is wellknown, and confired by our experiental results, that a nuber of sall cores achieve better aggregate perforance (throughput) than a high-perforance SMT core under the sae power budget. Hence, it is to be expected that overall perforance will be higher for a hoogeneous ulti-core with any sall cores as well as for a heterogeneous ulticore with a few big cores and any sall cores, when there are any active threads in the syste. However, under variable active thread workload conditions, a hoogeneous design with big SMT cores is a copetitive design point because it can ore easily adapt to software diversity, and deliver both best possible chip throughput when there are few active threads, and coparable perforance when there are any active threads. While we show that a hoogeneous ulti-core consisting of all big cores with SMT is copetitive to a heterogeneous ulti-core in ters of perforance, the latter has ore opportunities to save power by power-gating idle cores. Cores can only be switched off when there are fewer active threads than cores, resulting in fewer power-gating opportunities for configurations with fewer cores. We find however, that a heterogeneous ulti-core has only a slightly better energy-efficiency copared to a hoogeneous all-big-core configuration under variable thread-level parallelis. The overall conclusion fro this paper is that, although SMT was designed to iprove single-core throughput [], the real benefit of SMT in the ulti-core era is to provide flexibility with respect to the available thread-level parallelis. Consequently, we find that a hoogeneous ulticore with big SMT cores is a copetitive high-perforance, energy- and cost-efficient design point when the active thread count varies dynaically in the workload.. Motivation. Varying thread-level parallelis We identify at least four application doains that exhibit varying degrees of available thread-level parallelis during runtie. Multi-prograed workloads. The ost obvious reason for having a varying degree of active threads is due to ulti-prograing. Jobs coe and go, and hence the aount of thread-level parallelis varies over tie. Jobs are also scheduled out when perforing I/O (disk and network activity). Desktop applications. A recent study by lake et al. [] quantifies the aount of thread-level parallelis in conteporary desktop applications. They find the aount of threadlevel parallelis to be sall, with typically only to active threads on average, even after ten years of ulti-core processing. Server workloads. Servers in datacenters operate between to percent of their axiu utilization level ost of the tie according to arroso and Hölzle []. They found the distribution of utilization at a typical server within Google to have a peak around zero utilization and percent utilization. A ulti-core server that is underutilized iplies that there are only few active threads. Multi-threaded applications. Even ulti-threaded applications ay not have as any active threads as there are software threads at all ties during the execution. Threads ay be waiting because of synchronization due to locks, barriers, etc., and ay yield to the operating syste to avoid active spinning. Figure quantifies the nuber of active threads when running the PARSEC bencharks [] on a twenty-core processor. (We refer to Section for details on the experiental setup.) Soe bencharks have active threads ost of the tie (blackscholes, canneal and raytrace), whereas others have active threads only a sall fraction of the tie (e.g., ferret, freqine and swaptions). Soe bencharks have either one or twenty active threads (e.g., bodytrack and swaptions), others have a larger variation in the nuber of active threads (e.g., dedup, ferret and freqine). On average across all PARSEC bencharks running on cores, we find that there are active threads only half of the tie, and % of the tie, only or fewer threads are active. Note that these nubers are generated for the parallel part of the application the so-called region of interest (ROI) as it is defined for the PARSEC bencharks so the liited nuber of active threads only stes fro inter-thread synchronization during parallel execution, and is not due to other sequential code such as initialization.

fraction of tie % 9% 8% 7% 6% % % % % % % threads 6-9 threads - threads 6- threads threads threads threads threads thread Figure. Distribution of the nuber of active threads for the PARSEC bencharks on a twenty-core processor.. Multi-core design choices There exist three ajor ulti-core architectures: syetric or hoogeneous, asyetric or heterogeneous, and dynaic []. All cores in a hoogeneous ulti-core have the sae organization; exaples are the Intel Sandy ridge CPU [], AMD Opteron [], IM POWER7 [], etc. Each core typically ipleents Siultaneous Multi-Threading (SMT), effectively providing a any-thread architecture, e.g., an 8-core processor with SMT threads per core effectively yields a -threaded processor. A heterogeneous (or asyetric) ulti-core features one or ore cores that are ore powerful than others. In case of a single-isa heterogeneous ulti-core, there are so-called big, high-perforance cores and sall, energy-efficient cores. NVidia s Kal-El [] integrates four perforancetuned cores along with one energy-tuned core, and ARM s big.little [8] cobines a high-perforance core with a low-energy core. A dynaic ulti-core is able to cobine a nuber of cores to boost perforance of sequential code sections. Core fusion [, 7] dynaically orphs cores to for a bigger, ore powerful core. Thread-level speculation and helper threads [9, 8], in which assist-threads running on other cores help speeding up another thread, could also be viewed as a for of dynaic ulti-core. Recently, Khubaib et al. [6] propose MorphCore, which is a high-perforance out-of-order core that can orph into a any-threaded inorder core when the deand for parallelis is high.. Goal of this paper Given the background in workloads and the ulti-core design space as just described, the following key question arises: How to best design a single-isa ulti-core processor in light of varying degrees of thread-level parallelis in conteporary workloads? As entioned in the introduction, all three design options can deal with varying nubers of active threads, one way or the other. A hoogeneous ulti-core can distribute the active threads across the various cores and only activate SMT when there are ore active threads than cores. A heterogeneous ulti-core can schedule ig core Mediu core Sall core Frequency.66GHz.66GHz.66GHz Type Out-of-Order Out-of-Order In-Order Width RO size 8 N/A Func. units int, ld/st int, ld/st int, ld/st ul/div ul/div ul/div FP FP FP SMT contexts up to 6 up to up to L I-cache K 6K 6K -way assoc -way assoc -way assoc L D-cache K 6K 6K -way assoc -way assoc -way assoc L cache 6K 8K 8K 8-way assoc -way assoc -way assoc Last-level cache 8M, 6-way assoc On-chip interconn..66ghz, full cross-bar DRAM 8 banks, ns access tie Off-chip bus 8G/s Table. ig, ediu and sall core configurations. the active threads on the big cores and only schedule threads on the sall cores when there are ore active threads than big cores. A dynaic ulti-core can for as any cores as there are active threads. However, without a detailed and coprehensive study, it is unclear which ulti-core architecture paradig yields best perforance under varying active thread counts. This paper, to the best of our knowledge, is the first to explore this ulti-core design space and coprehensively copare ulti-core paradigs in light of variable active thread count. Note that specialized accelerators are not in this paper s scope, as we focus on single-isa ulti-cores.. Experiental Setup. Multi-core design space To evaluate the various ulti-core paradigs in the context of varying thread counts, we use the following experiental setup. We consider three types of cores: a four-wide out-oforder core (big core), a two-wide out-of-order core (ediu core), and a two-wide in-order core (sall core), see also Table for ore details about these icroarchitectures. We copare all ulti-core architectures under the (approxiate) sae power envelope. We therefore estiate power consuption using McPAT [] (assuing n technology and aggressive clock gating). The big core consues approxiately.8 ties the power of the ediu two-wide OoO core on average, and. ties the power of the sall two-wide in-order core. We conservatively assue that one big core is power-equivalent to two ediu cores and five sall cores. We validate later in this section that these scaling factors result in an approxiately equal power consuption, even when the big cores execute six threads through SMT (which leads to higher utilization and therefore higher dynaic power consuption). When evaluating energy efficiency in Section 7, we assue idle cores are power gated.

We keep total on-chip cache capacity constant when exploring the ulti-core design space, in order to focus on the ipact of core types and organization, and not cache capacity. This iplies that we have to set the private cache size of the ediu core two ties saller copared to the big core, and five ties for the sall core, see also Table. (We pick nubers that are powers of two or just in between two powers of two). The last-level cache () is shared across all cores, and has the sae size for all ulti-core configurations (8M). The on-chip network is a full crossbar between all cores and the shared. Although not realistic, a full crossbar ensures that the results are not skewed in favor of the few large cores configuration, which would experience less contention in the on-chip network copared to a any sall cores configuration. We use the ulti-core siulator Sniper [] enhanced with cycle-level out-of-order and in-order core odels, as well as SMT support. Total chip power budget is equivalent to big cores or 8 ediu cores or sall cores, plus a shared. This allows for 9 possible designs, see Figure. (For the heterogeneous designs, we only consider ixes of big cores and ediu cores or sall cores; we do not consider ixes of ediu and sall cores). In the reainder of the paper, these designs are referred to as,,,,,,, and, as indicated in the figure., and are hoogeneous ulti-cores (all cores of the sae type), while the others are heterogeneous. With SMT enabled, we assue that a big core is able to execute up to six threads; a ediu core can execute up to three threads; and a sall in-order core can execute up to two threads (using fine-grained ultithreading), so that all configurations can run up to threads. The SMT core that we siulate ipleents static RO partitioning and a round-robin fetch policy []. The average total (static plus dynaic) power consuption of the three hoogeneous configurations running threads is 6 Watt for, Watt for, and Watt for (averaged across all hoogeneous ulti-progra workloads, see later). The power consuption of the heterogeneous configurations varies between 6 and Watt. This justifies our clai that all configurations operate ore or less under the sae power envelope.. Workloads Multi-progra workloads. We consider ulti-progra workloads using the SPEC CPU 6 bencharks with their reference inputs. In order to liit the nuber of siulations, we select representative benchark-input cobinations. The selection is based on the relative perforance of the bencharks on the three core types. We evaluated all SPEC CPU 6 benchark-input cobinations on the three core designs (big, ediu and sall) and calculated relative perforance with respect to the big core. We then picked bencharks that cover the full perforance range, i.e., the bencharks that have the highest and lowest rela- s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s Figure. The nine power-equivalent ulti-core designs considered in this study (=big core, =ediu core, s=sall core). tive perforance, along with in-between bencharks picked such as to provide good coverage. For each benchark, we take a 7 illion instruction single siulation point to reduce siulation tie [6]. When running a ulti-progra workload, we stop the siulation when all of the progras have executed at least 7 illion instructions, thereby restarting progras that reached the end of the 7 illion instruction siulation point. We suarize ulti-progra perforance using the syste throughput (STP) etric [7] or weighted speedup [7], which is a easure for the nuber of jobs copleted per unit of tie. For coputing STP, we noralize against isolated execution on the big core. When reporting STP nubers averaged across a set of workloads, we use the haronic ean because STP is a rate etric (inversely proportional to tie). We also calculate average noralize turnaround tie (ANTT [7]) to show the ipact of the ulti-core design on per-progra perforance. We evaluate hoogeneous ulti-progra workloads (ultiple copies of the sae benchark) as well as heterogeneous ulti-progra workloads (different bencharks co-run). We vary the nuber of progras fro to. For the heterogeneous ulti-progra workloads, we randoly construct two-, three-, four-, etc., up to twenty-four-thread cobinations, while aking sure that every benchark is included an equal nuber of ties for all thread counts. Velasquez et al. [] show that this balanced rando sapling technique is ore representative copared to fully rando sapling. We intentionally liit the nuber of active threads to to reflect a (realistic) situation with a odest and variable thread count. Given the hardware budget of big cores, this is already a considerable nuber of threads (6 threads per core). Our results confir that at a (constantly) large thread count, a design with any sall cores is optial, but in this study we specifically target those workloads that exhibit a variable active thread count. Furtherore, we believe our results are general enough to be projected to larger hardware

budgets and thread counts (e.g., 8 large cores and up to 8 threads). Scheduling also plays an iportant role in ulti-progra workload perforance. A general principle that we aintain is to first schedule threads on the big core(s) in a heterogeneous design before scheduling on the sall cores. Likewise for SMT, we first distribute threads across cores before engaging SMT, e.g., when there are fewer active threads than cores, we run each thread on a separate core, but when there are ore active threads than cores, we need to co-run threads on a single core through SMT. A heterogeneous design also iplies deciding which thread to execute on which core. Siilarly, in the case of SMT, we need to decide which threads to co-run on a core, since different co-runner schedules ay have significant ipact on perforance [7]. As exploring all possible cobinations of progra schedules is infeasible because of siulation tie considerations, we use offline analysis for deterining the best possible schedule. We run each benchark on each of the different core types in isolation, and use this analysis to steer application-to-core apping for the heterogeneous design points for best perforance. Likewise for SMT, we run all possible two-, three-, etc., up to six-progra cobinations on the big core (up to four for the ediu and two for the sall cores), and select the best possible co-schedule. This approach ignores the ipact of resource sharing aong cores to steer scheduling, however, we do account for resource sharing (shared cache, eory bandwidth, etc.) during detailed siulation for the selected schedules. Multi-threaded workloads. We also evaluate ulti-threaded workloads, using the PARSEC bencharks []. We vary the nuber of threads fro to in steps of. We only included the bencharks that allow for a nuber of threads that is not fixed to a power of. We use the ediu size input set for all bencharks, and evaluate the execution tie for the parallel part only (the so-called region of interest or ROI) and for the whole progra (including the sequential initialization and finalization code). We report speedups versus a four-threaded execution on the configuration.. Multi-Progra Workloads We now evaluate the perforance of the nine ulti-core designs for ulti-progra workloads, i.e., workloads consisting of ultiple single-threaded progras. (We will discuss ulti-threaded workloads in the next section.) We first discuss perforance as a function of thread count, and subsequently copute aggregate perforance under the assuption of various active thread count distributions.. Perforance as function of thread count Figure shows average perforance for the nine ulti-core configurations as a function of the nuber of threads fro to. All designs have SMT enabled in all cores, the non-smt curves can be reconstructed by leveling off perforance as soon as thread count equals core count. The interesting observation is that the hoogeneous configuration perfors well copared to the other hoogeneous and heterogeneous designs. Although the heterogeneous designs outperfor the configuration for soe thread counts, perfors well over the full range of thread counts. When thread count is low, yields the highest perforance, and when thread count is high, yields only slightly lower perforance copared to the any ediu and sall core designs ( and ). It is not surprising that ulti-core configuration perfors well for low thread counts: for or fewer threads, each powerful big core has only one thread running. What is ore rearkable is that the SMT ulti-core perfors also relatively well when thread count is high: for exaple, when there are threads, each core executes six threads concurrently, but perforance is close to that of running threads on 8 ediu cores (each three-way SMT) or threads on sall cores ( cores use -way SMT, the others execute only one thread). To explain this behavior, Figure shows the sae graphs for two hoogeneous workloads, which were picked to illustrate the interesting diversity observed across the various bencharks; we found the bencharks to roughly classify into these two categories. Tonto (left graph) shows the intuitively expected behavior: up to 8 threads, perforance of the SMT ulti-core is better than or siilar to the perforance of the heterogeneous architectures, but beyond 8 threads, its perforance is inferior. Tonto clearly benefits fro the higher aggregate execution resources available in the heterogeneous design points as well as in the hoogeneous ulti-core with all ediu or sall cores at high active thread counts. For libquantu (right graph) on the other hand, the ulti-core with SMT perfors approxiately as well as the other design points for high thread counts. What happens here is that as the nuber of threads increases, ore and ore pressure is put onto the shared resources (shared last-level cache, eory bandwidth, DRAM banks, etc.), upto the point that perforance gets largely doinated by shared resource contention and less by individual core perforance. In particular, we observe that, for libquantu, eory access tie is ties higher for threads than for one isolated thread for both the and configurations due to contention on the eory bus. This tightens the gap and flattens out the perforance differences between the various ulti-core configurations. It is interesting to note that, at high thread counts, the perforance of versus the other design points is slightly saller for the hoogeneous workloads than for the heterogeneous workloads, copare graphs (a) versus (b) in Figure. In fact, for heterogeneous workloads and threads, we notice that the perforance of is only 7.% lower than the axiu (), while for hoogeneous workloads, s perforance is.6% lower than

7 noralized throughput..... 7 9 7 9 noralized throughput 6 7 9 7 9 thread count thread count (a) Hoogeneous workloads (b) Heterogeneous workloads Figure. Coparing the perforance for the nine ulti-core design points with hoogeneous and heterogeneous ultiprogra workloads. 9. noralized throughput 8 7 6 7 9 7 9 noralized throughput... 7 9 7 9 thread count thread count (a) tonto (b) libquantu Figure. Perforance of the nine ulti-core design points for two representative bencharks (both hoogeneous ultiprogra workloads): (a) tonto and (b) libquantu. the axiu (). This is due to the fact that heterogeneous workloads consist of ixes of both eory-intensive and copute-intensive bencharks. Scheduling a eoryintensive benchark with copute-intensive bencharks on one core using SMT enables the eory-intensive benchark to occupy a larger fraction of the core s private cache (6K in our study), as the copute-intensive bencharks are less deanding for cache space. In case of a ulti-core with any sall cores (), each core has a sall private cache (8K in our setup), hence, a eory-intensive benchark would not get as uch cache space. y intelligently scheduling bencharks to cores and SMT thread contexts, the ulti-core is better capable of utilizing cache space than a ulti-core with any sall cores and relatively saller private caches. For copleteness, Figure shows the average noralized turnaround tie (ANTT) for the hoogeneous workloads as a function of thread count (the results for heterogeneous workloads are siilar). At sall thread counts, the design results in the lowest per-progra execution tie (highest per-progra perforance), because all threads can run on a big core. Per-progra execution tie increases as thread count goes up, because ore threads share a core through SMT, reducing per-progra perforance. For the other ex- average noralized turnaround tie 6 7 9 7 9 thread count 8s t s t s t 6s t Figure. Coparing the ANTT for the nine ulti-core design points with hoogeneous ulti-progra workloads. tree configuration, the turnaround tie is larger for low thread counts, because of the poorly perforing cores, but it reains ore stable as thread count increases, due to a saller degree of sharing. The conclusions are siilar to that of the througput results: at low thread counts, has the highest throughput and the lowest per-progra execution tie, and at high thread counts, the configurations with ore and saller cores have the highest throughput and the lowest per-progra execution tie, but the configuration reains close.

. noralized throughput.. noralized throughput.... hoogeneous workloads heterogeneous workloads Figure 6. Average perforance assuing a unifor thread count distribution and no SMT. We conclude this section with our first finding: Finding #: A hoogeneous ulti-core consisting of all big SMT cores yields better perforance than a heterogeneous ulti-core for a sall nuber of threads (due to the bigger cores) and only slightly worse for a large nuber of threads (because shared resource contention largely doinates perforance for workload ixes of eory-intensive applications, and cache capacity can be used ore efficiently through intelligent scheduling).. Thread count distributions We now copare the ulti-core designs under various active thread distributions, assuing unifor distributions as well as distributions observed in datacenter operations... Unifor distribution. We begin with assuing a unifor distribution over threads, i.e., each thread count ( to threads) has equal probability. No SMT. We first assue that none of the cores ipleent SMT. Figure 6 shows the average perforance for all of the ulti-core designs without SMT. Each core can execute only one thread at a tie, and when there are ore threads than cores, ultiple threads run on one core sequentially through tie-sharing. Clearly, the configuration outperfors the other hoogeneous configurations (, ). eing able to execute faster at low thread counts is ore iportant than achieving a high throughput at high thread counts. This is in line with Adahl s law: as parallelis increases, the perforance of the sequential part (low thread count) doinates the perforance of the progra as a whole. The ost iportant conclusion is that the optial design without SMT is for hoogeneous workloads and for heterogeneous workloads both heterogeneous ulti-core designs. Hence our second finding: Finding #: In the absence of SMT, heterogeneous ulticores outperfor hoogeneous ulti-cores across varying thread counts. At low thread counts, the big cores in a heterogeneous ulti-core can be used to get high perforance, hoogeneous workloads heterogeneous workloads Figure 7. Average perforance assuing a unifor thread count distribution and SMT in the hoogeneous configurations. noralized throughput.... hoogeneous workloads heterogeneous workloads Figure 8. Average perforance assuing a unifor thread count distribution and SMT in all configurations. while at high thread counts, the larger aount of sall cores can be used to exploit thread-level parallelis. This is in line with recent work that advocates single-isa heterogeneous ulti-core processors [, 9]. SMT in hoogeneous designs. We now assue SMT is ipleented in the hoogeneous designs (, and ), but not the heterogeneous designs. Figure 7 shows average perforance for the various designs. It is interesting to copare this graph against the one in Figure 6, which showed that heterogeneous ulti-cores yield higher perforance than hoogeneous ulti-cores when the nuber of threads varies. Now, through Figure 7, we observe that by adding SMT to the hoogeneous ulti-cores, the design outperfors the other designs. This leads to: Finding #: A hoogeneous ulti-core with big SMT cores outperfors a heterogeneous ulti-core (without SMT) under the sae power budget. Put differently, SMT outperfors heterogeneity as a eans to cope with varying thread counts. The intuition is that, at low thread counts, the design with SMT is able to use all big cores, while the nuber of big cores in the heterogeneous designs is always saller. At high thread counts, a hoogeneous ulti-core with big SMT cores allows for ore concurrent threads ( in total) copared to heterogeneous ulti-cores (at ost in the

6 noralized throughput Figure 9. Average perforance per benchark assuing a unifor thread count distribution. frequency...8.6.. 7 9 7 9 thread count (a) Datacenter distribution noralized throughput 7 6. 6.... without SMT with SMT without SMT with SMT datacenter (b) Average throughput datacenter irrored Figure. Datacenter distribution and average perforance using the datacenter distribution and the irrored datacenter distribution. design point), yielding higher overall throughput within the sae power budget. SMT in all designs. Finally, Figure 8 shows average perforance when SMT is enabled in all cores of all designs. For hoogeneous workloads, the perforance for the best heterogeneous configuration is.6% higher than that of without SMT in all configurations (Figure 6), but only.6% higher than with SMT in all designs (Figure 8). For heterogeneous workloads, the hoogeneous design even outperfors the best heterogeneous design by.%. Thus, in other words: Finding #: The added benefit of cobining heterogeneity and SMT is liited. It is also interesting to observe that the optial heterogeneous design shifts fro without SMT to with SMT for the hoogeneous workloads, and fro to for the heterogeneous workloads. Hence: Finding #: Adding SMT to the heterogeneous designs akes the optiu shift towards fewer and larger cores. This is in line with the general observation that SMT in larger cores enables flexibility as a function of active thread count. Per-benchark results. Figure 9 shows average perforance for the various ulti-core configurations (SMT enabled in all cores) for each benchark, assuing a unifor distribution. The results vary across bencharks: for soe bencharks (calculix, h6ref, her and tonto), perfors worse than the best heterogeneous ulti-core, while for others it perfors siilarly, or even slightly better (libquantu and cf). Detailed analysis of the results revealed that the latter category of bencharks have high eory bandwidth deands, resulting in bandwidth-bound perforance nubers for high thread counts. Section 8. contains results with a higher eory bandwidth setting... Datacenter distributions. Figure (b) shows average perforance across two different thread count distributions, assuing heterogeneous workload ixes. Datacenter is the distribution taken fro [] for CPU utilization in a datacenter, adapted to a workload of at ost threads; Figure (a) shows the distribution: there is a peak at thread (low utilization) and one at 7 to 9 threads (%-% utilization). Mirrored datacenter is the sae distribution, irrored around the center. This eans that there now is a peak at threads, and one around 6 to 8 threads. We use this distribution to odel a ore heavily

noralized speedup.6...8.6.. noralized speedup without SMT with SMT without SMT with SMT (a) ROI only (b) Whole progra Figure. Average noralized speedup for all PARSEC bencharks....8.6.. loaded server park, with a distribution skewed to the higher thread counts. For the datacenter distribution, is the best perforing configuration without SMT, see Figure (b). This is as expected: we have big core for the peak at thread, and 7 cores in total to cover the peak around 7 threads. Adding SMT again akes the fewer but bigger cores configurations ore optial, with the best perforance for the configuration. For the irrored datacenter distribution, the optiu without SMT is, because there is a peak at 6 threads. For the configurations with SMT, is optial, with perforing only.6% worse. Finding #6: For distributions that are skewed to fewer threads, the configuration with SMT is optial. For distributions that are skewed towards ore active threads, with SMT becoes less optial, but its perforance is very close to the optiu.. Multi-Threaded Workloads As discussed in Section, ulti-threaded progras can also have a variable nuber of active threads. When threads have to wait due to synchronization (e.g., a barrier), they can be scheduled out by the operating syste to free resources for other runnable threads. Periods with low active thread count are critical to perforance, since they exhibit little parallelis and are therefore ore difficult to speed up [6]. Achieving high perforance at low thread counts is therefore likely to be even ore crucial for ulti-threaded workloads than for ulti-prograed workloads. We use the PARSEC bencharks in this section, and always report the axiu speedup across all possible thread counts. Note this does not necessarily equal total core count because of interference between threads in shared resources. We further assue pinned scheduling which pins threads to cores to iprove data locality (as done in odern ulti-core schedulers []); and we execute serial phases on the big core when reporting whole progra perforance results. We liit the discussion in this section to heterogeneous designs with a single big core as we assue pinned scheduling which does not enable benefiting fro ultiple big cores. (We verified that none of the other heterogeneous designs have larger speedups than the ones reported here.) The results, averaged across all bencharks, are shown in Figure. We split up the results for the ROI-only and the whole progra, and show the speedups without SMT (i.e., nuber of threads equal to nuber of cores) and with SMT. For the ROI-only results without SMT, is the optial design. This is because ost of the applications scale well up to 8 threads, but not beyond. Adding SMT boosts the speedup for the design, and akes its speedup very close to that of. Overall, the design with SMT perfors well for bencharks that have poor parallelis, and perfors only slightly worse for progras that scale well. For the whole-progra results, the design perfors best both without and with SMT. The design perfors best for applications with liited parallelis, and close to optial for applications that scale better, but that have a large initialization and finalization serial phase. Without SMT, the heterogeneous designs perfor close to the configuration: they speed up the serial phases, but on average, the poorly scaling bencharks achieve better perforance on the configuration and this doinates the average. With SMT enabled, the difference between the configuration and the heterogeneous configurations is larger, because with SMT speeds up well-scaling bencharks ore. Figure shows per-benchark speedups. For the ROIonly results (top graph), it clearly shows the difference across bencharks: is optial for well-scaling bencharks, while or a heterogeneous design are optial for poorly scaling bencharks. For the whole progra results (botto), the optial configuration is or a heterogeneous design for ost of the bencharks. Finding #7: SMT is also beneficial for ulti-threaded workloads. As for the ulti-progra workloads, adding SMT lets the optial design shift to fewer but larger cores. A hoogeneous design with big SMT cores outperfors the best heterogeneous design without SMT, and perfors close to, and soeties even slightly better than, the best heterogeneous design with SMT.

(a) ROI-only noralized throughput noralized throughput..... 7 6 7 9 7 9 nuber of threads (a) Hoogeneous ulti-progra workloads dynaic w/o SMT dynaic w/ SMT dynaic w/o SMT dynaic w/ SMT (b) Whole progra Figure. Noralized speedup for the individual PARSEC bencharks. 6. Dynaic Multi-Cores Dynaic ulti-cores are ulti-core processors with a dynaic configuration [, 7]: core configuration and the nuber of cores can dynaically vary between any sall cores and a few large cores, and in-between heterogeneous configurations. Theoretical studies, such as the one of Hill and Marty [], show that this type of ulti-core is optial in the context of varying parallelis and varying thread count. Through dynaic adaptation, one or a few big cores can be fored when there is low parallelis, while the configuration is changed to any sall cores when there are a lot of active threads. This technique is essentially the inverse of SMT: an SMT core executes a single thread but can execute ultiple threads at higher active thread counts; a dynaic ulti-core executes threads on independent cores, which can be fused to bigger cores at low active thread counts. To copare the abilities of a hoogeneous ulti-core with big SMT cores () versus a dynaic ulti-core to cope with varying active thread counts, we assue an ideal dynaic ulti-core that can be orphed without overhead into any of the 9 ulti-core configurations in Figure. This ideal dynaic ulti-core chooses the best perforing configuration (out of the 9 possible configurations) at each thread count for each workload. This is an optiistic assuption in favor of dynaic ulti-cores, since fusing cores is likely to involve a non-negligible tie, area and power overhead. Figure copares dynaic ulti-cores (both with and without SMT) against the configuration (with SMT) for the hoogeneous and heterogeneous 7 9 7 9 nuber of threads (b) Heterogeneous ulti-progra workloads Figure. Throughput as a function of the nuber of threads for the configuration with SMT and the dynaic core fusion configuration with and without SMT. ulti-progra workloads. This figure shows that dynaic ulti-cores without SMT yield siilar or even worse overall perforance. Especially for heterogeneous workloads, SMT sees to perfor better than a dynaic ulti-core design. The reason is that SMT enables better utilization and higher throughput within the sae power budget, especially when the progras are copleentary in their resource deands. SMT also allows for ore fine-grained parallelis: for the dynaic ulti-core, a big core can be split up into ediu cores or sall cores, but an SMT core can also execute and threads concurrently, while fully utilizing all resources. As a result, the line in Figure (b) soothly increases, while the dynaic line (without SMT) shows ultiple plateaus with jups when the configuration changes. A dynaic ulti-core that also supports SMT perfors the best, but this will probably result in a very coplex design and an even ore coplex scheduling and reconfiguration policy. We thus conclude: Finding #8: Hoogeneous ulti-cores with big SMT cores outperfor (or are at least copetitive to) dynaic ulticores as a way to cope with variable active thread counts. A cobination of both is optial, but is also the ost coplex, both with respect to design and run-tie scheduling.

power (W) 6 7 9 7 9 thread count Figure. Power consuption as a function of thread count for all configurations assuing power gating. 7. Energy Efficiency In the previous sections, we focused on perforance under an equal total power budget. However, power-gating can be used to turn off idle cores, resulting in lower power consuption at low active thread counts. Especially for the configurations with any ediu or sall cores, this ay result in iproved power/energy-efficiency copared to the hoogeneous configuration with a few big SMT cores. Power consuption as a function of thread count. Figure shows average power consuption for all configurations (all configurations have SMT enabled in all cores) as a function of thread count when power-gating unused cores (averaged across all hoogeneous ulti-progra workloads). It is interesting to study power consuption along with perforance as shown in Figure : the configuration consues ost power at low active thread counts while delivering highest perforance; the configuration consues least power while delivering poorest perforance; on the other hand, at high thread counts, all configurations perfor nearly as well while consuing siilar levels of power. Figure also shows that activating SMT contexts increases power consuption, due to the increase in resource utilization, but not as uch as the increase in power consuption fro activating cores (see for exaple the configuration: power consuption increases fro Watt for threads to 6 Watt for threads). Note that the nubers for one thread (leftost points) do not show the // relative power difference for the big, ediu and sall cores (the power consuption for one active core is 7.,. and 9.8 Watt, for, and s, respectively). This is because the shared L cache and the ain eory (DRAM) are active all the tie, irrespective of active thread count these resources consue approxiately 7 Watt. The relative difference in power consuption for the three core types is reflected in the slopes of the, and configurations (part of the curves that do not use SMT, i.e., with thread count lower than or equal to core count). Pareto-optial designs. Figure shows the power and energy consuption as a function of perforance for the power (W) noralized energy noralized throughput 8 6 (a) Power versus perforance noralized throughput (b) Energy versus perforance Figure. Throughput versus power (top) and energy (botto) consuption for heterogeneous ulti-progra workloads (assuing a unifor thread count distribution). heterogeneous ulti-progra workloads (assuing a unifor thread count distribution). There are several interesting observations to be ade. First, the configuration consues the least power, but results in high energy consuption due to its poor perforance. In other words, a configuration with any sall cores is not energy-optial. Second, the configuration is the best perforing, but also has higher power and energy consuption. Third, the Pareto-optial frontier is populated with heterogeneous design points, along with the best-perforance and lowestpower configurations: the Pareto-optial frontier consists of the following design points,,,,,, and, for power versus perforance (top graph in Figure ), and, and, for energy versus perforance (botto graph). In other words, heterogeneity trades off perforance for power and energy consuption. The design point with the iniu energy-delay product (EDP) across all the designs considered is the configuration, yet this heterogeneous design point iproves EDP by as little as.% and.8% over the design point for the hoogeneous and heterogeneous workloads, respectively. This leads to the following finding: Finding #9: Heterogeneous ulti-core designs, when power gating idle cores, yield an (only) slightly better energyefficiency copared to hoogeneous ulti-cores with big SMT cores under variable active thread count conditions.

noralized speedup.6...8.6.. Figure 6. Average ulti-threaded benchark perforance with alternative large-cache and high-frequency configurations. 8. Alternative Multi-Core Designs 8. Larger caches or higher frequency for the sall cores In Section, we assued particular design decisions that ay ipact the final results. One decision was to keep total cache capacity constant across all designs. The otivation was to evaluate the ipact of core type and organization, not cache capacity. Nevertheless, we noticed that sharing a cache between ultiple progras co-executing on an SMT core can lead to better cache usage. We therefore now evaluate the effect of keeping private cache sizes constant across core types. We also evaluate the ipact of increasing frequency of the sall cores to iprove its perforance. Figure 6 shows average speedup for the ulti-threaded bencharks (ROI-only). 6 lc and 6s lc (lc stands for larger cache) are configurations where the private L and L cache sizes for the ediu and sall cores are equal to that of the big core. Larger caches consue ore power, leading to a different power-equivalence aong core types: a big core is now power-equivalent to. ediu cores and sall cores, which explains the decreased core count for the configurations with a larger cache. Further, the 6 hf and 6s hf configurations contain 6 ediu cores or 6 sall cores with clock frequency increased fro.66 GHz to. GHz. This increase in frequency also results in a to., and a to power-equivalence between the big and ediu cores, and the big and sall cores, respectively. The results in Figure 6 show that a larger cache and, ore distinctly, higher frequency leads to a higher speedup for the sall-core configuration (copare 6s lc and 6s hf versus ). This is because any bencharks do not scale well up to threads, and reducing core count in exchange for ore cache capacity or a higher frequency results in higher speedup. For the ediu-core configuration () on the other hand, enlarging the cache or increasing the frequency has a negative ipact on perforance: the benefits of a large cache or a higher frequency do not copensate for the reduction in core count. Overall, we observe that a hoogeneous ulti-core with big SMT cores achieves best perforance for the given power budget. Hence, we conclude that: Finding #: Enlarging the caches or increasing the frequency of the ediu and sall cores does not affect the general observation that a hoogeneous ulti-core with big SMT cores is close to optial. 8. Higher eory bandwidth Another decision ade in our initial setup was to set the eory bandwidth to 8 G/s. However, as entioned before, for soe bencharks, eory bandwidth turns out to be a bottleneck. We therefore now double eory bandwidth to 6 G/s, see Figure 7. Coparing this Figure to Figures 8 and, we observe that perforance increases for all configurations, albeit by a sall argin. For the hoogeneous ulti-progra workloads, now achieves a.8% lower throughput than the optiu (which was.6% for 8 G/s), and a.% lower throughput for the heterogeneous ulti-progra workloads (used to be.% higher). For the ulti-threaded progras, considering ROI only, we observe a speedup for that is.9% lower than the optiu (which was.8% before), and a.8% higher speedup when considering the whole progra (.9% before). The progras that were bandwidth-bound in the 8 G/s setup now achieve better perforance across all configurations. These eory-bound bencharks especially benefit fro SMT, ore so than copute-bound progras []. Hence, our conclusion: Finding #: Even under high available eory bandwidths does the perforance of a hoogeneous design with big SMT cores reain close to the heterogeneous configurations. 9. Related Work Olukotun et al. [] ake the case for ulti-core processing. y coparing an aggressive single-core processor (6- wide out-of-order) and a dual-core processor consisting of -wide out-of-order cores, they found that parallelized applications with liited parallelis achieve coparable perforance on both architectures, and that applications with large aount of coarse-grained parallelis achieve significantly better perforance on the dual-core. Kuar et al. [9] argue that a single-isa heterogeneous ulti-core processor covers a spectru of workloads better than a conventional ulti-core processor, providing good single-thread perforance when thread-level parallelis is low, and high throughput when thread-level parallelis is high. Our results confir this finding: the heterogeneous ulti-core configurations achieve better overall perforance copared to across the broad range of active thread counts when SMT is not enabled. However, Kuar et al. did not consider and copare against a hoogeneous ulticore with big SMT cores, which we find to achieve a level of perforance that is copetitive to a heterogeneous design

noralized throughput....6...8.6... noralized speedup hoogeneous (a) ulti-progra heterogeneous ROI only (b) ulti-threaded whole progra Figure 7. Perforance nubers assuing 6 G/s eory bandwidth. under varying degrees of thread-level parallelis, while being less costly to design and verify. Ipek et al. [] and Ki et al. [7] propose to fuse sall cores to for bigger cores when there are few active threads. y doing so, the ulti-core processor becoes ore dynaic and can ore easily adapt to software diversity. Our results indicate that siilar perforance benefits can be achieved through the opposite echanis: instead of fusing sall cores to for a big core when there are few active threads, one could schedule threads across big SMT cores (and have few active SMT threads per core) in a hoogeneous ulti-core. Khubaib et al. [6] build on a siilar insight when proposing MorphCore, an aggressive out-of-order core with -way SMT that can orph into an energy-efficient 8-way SMT in-order core. The idea is to switch between the two odes of operation depending on the aount of available thread-level parallelis: with few (one or two) active threads, the core runs in out-of-order ode, and switches to in-order SMT with ore active threads. Whereas Khubaib et al. focus on the proposal of an energyefficient core design that can switch between out-of-order and wide-smt in-order operation, the focus of our work is to study the ipact of variable thread-level parallelis in the workload, and how this affects ulti-core design decisions. More specifically, we consider distributions of active thread counts in ulti-progra workloads next to ulti-threaded workloads to copare hoogeneous ulti-cores with SMT against heterogeneous and dynaic ulti-cores. MorphCore is copleentary to our work and can be leveraged to further iprove energy-efficiency of the big SMT cores when running ultiple SMT threads. Hill and Marty [] evaluate the three ajor ulti-core processor architecture paradigs hoogeneous, heterogeneous and dynaic ulti-core and derive high-level insights fro Adahl s Law. Their odel did not consider SMT and assued that software is either sequential or infinitely parallel. One of the results that they obtain is that heterogeneous ulti-cores can achieve better perforance than hoogeneous ulti-cores; also, they find that dynaic processors achieve better perforance than heterogeneous ulti-cores with identical functions of perforance per unit area. While this is true considering the assuptions ade, our results show that this is not necessarily the case when the nuber of available threads varies over tie. A nuber of papers have explored how to take advantage fro heterogeneity to iprove ulti-threaded application perforance. Annavara et al. [] propose running sequential portions of a ulti-threaded application at a higher power budget, thereby significantly iproving perforance while reaining within a given power budget. Intel s Turbooost [] offers siilar functionality by boosting clock frequency. Sulean et al. [9] accelerate the execution of critical sections by exploiting high-perforance cores in a heterogeneous ulti-core, i.e., a thread that executes a critical section is igrated to a big core in order to reduce serialization tie a technique called Accelerating Critical Sections (ACS). Joao et al. [] generalize this principle to other types of synchronization bottlenecks, including critical sections, barriers and pipes. All of these approaches exploit the fact that the nuber of active threads varies over tie and leverage heterogeneity to iprove perforance. This paper suggests that siilar perforance benefits ight potentially be achieved through SMT on a hoogeneous ulticore. More specifically, when a thread is executing sequential code (e.g., initialization, critical section, etc.), scheduling it on a single core with the other SMT threads throttled, ight achieve siilar perforance benefits, and does not require igrating (or arshaling []) data when a thread is igrated fro a sall to a big core in ACS. Li et al. [] copare the energy-efficiency and theral characteristics of SMT versus ulti-core. They report that, assuing an equal area budget, SMT is ore energyefficient than ulti-core for eory-intensive workloads; the inverse is true for copute-intensive workloads. Kuar et al. [8] exploit dynaic tie-varying application behavior to schedule applications on the ost energy-efficient core

in a heterogeneous ulti-core, and they report substantial energy savings copared to a hoogeneous ulti-core. In contrast to this prior work, we explore ulti-core configurations under variable thread-level parallelis conditions.. Conclusion The nuber of active threads varies over tie in today s coputer systes. This has been observed across any application doains, ranging fro ulti-progra systes, desktop applications, datacenter servers, and even ultithreaded applications. This paper studied how varying degrees of thread-level parallelis in the workload affect ulti-core design decisions. We considered hoogeneous, heterogeneous and dynaic ulti-cores under an equal power budget, and conclude that a hoogeneous ulti-core consisting of big SMT cores achieves coparable or slightly better perforance copared to heterogeneous ulti-cores (both with and without SMT) and dynaic ulti-cores. The reason is that a hoogeneous ulti-core with big SMT cores can better adapt to varying degrees of thread-level parallelis in the workload, and achieves higher per-thread perforance at low active thread counts and copetitive throughput at high active thread counts. Finally, we also find that heterogeneous ulti-cores are (only) slightly ore energy-efficient copared to a hoogeneous all-big-core configuration with SMT, when power gating idle cores. The overall conclusion is that, while ulti-cores with any sall cores, be it hoogeneous or heterogeneous architectures, outperfor hoogeneous ulti-cores with big SMT cores at full utilization, the inverse is typically true under variable active thread workload conditions, which akes hoogeneous ulti-cores with big SMT cores an appealing, cost-effective design point for the variable active threads workloads coonly observed in odern-day systes. Acknowledgents We thank the anonyous reviewers for their valuable and constructive feedback. Stijn Eyeran is a postdoctoral fellow of the Research Foundation Flanders. Additional support is provided by the European Research Council under the European Counity s Seventh Fraework Prograe (FP7/7-) / ERC Grant agreeent no. 99. Experients were run at the VSC Fleish Supercoputer Center. References [] M. Annavara, E. Grochowski, and J. Shen. Mitigating Adahl s law through EPI throttling. In Proceedings of the International Syposiu on Coputer Architecture (ISCA), pages 98 9, June. [] L. A. arroso and U. Hölzle. The case for energy-proportional systes. IEEE Coputer, : 7, Dec. 7. [] C. ienia, S. Kuar, J. P. Singh, and K. Li. The PARSEC benchark suite: Characterization and architectural iplications. In Proceedings of the International Conference on Parallel Architectures and Copilation Techniques (PACT), pages 7 8, Oct. 8. [] G. lake, R. G. Dreslinski, T. N. Mudge, and K. Flautner. Evolution of thread-level parallelis in desktop applications. In Proceedings of the International Syposiu on Coputer Architecture (ISCA), pages, June. [] T. E. Carlson, W. Heiran, and L. Eeckhout. Sniper: Exploring the level of abstraction for scalable and accurate parallel ulti-core siulation. In Proceedings of International Conference for High Perforance Coputing, Networking, Storage and Analysis (SC), pages : :, Nov.. [6] K. Du ois, S. Eyeran, J. Sartor, and L. Eeckhout. Criticality stacks: Identifying critical threads in parallel progras using synchronization behavior. In Proceedings of the International Syposiu on Coputer Architecture (ISCA), pages, June. [7] S. Eyeran and L. Eeckhout. Syste-level perforance etrics for ulti-progra workloads. IEEE Micro, 8():, May/June 8. [8] P. Greenhalgh. ig.little processing with ARM Cortex-A & Cortex-A7: Iproving energy efficiency in high-perforance obile platfors. http://www.ar.co/files/downloads/big LITTLE Final Final.pdf, Sept.. [9] L. Haond, M. Willey, and K. Olukotun. Data speculation support for a chip ultiprocessor. In Proceedings of the International Conference on Architectural Support for Prograing Languages and Operating Systes (ASPLOS), pages 8 69, Oct. 998. [] M. D. Hill and M. R. Marty. Adahl s law in the ulticore era. IEEE Coputer, (7): 8, July 8. [] E. Ipek, M. Kiran, N. Kiran, and J. F. Martinez. Core fusion: Accoodating software diversity in chip ultiprocessors. In Proceedings of the International Syposiu on Coputer Architecture (ISCA), pages 86 97, June 7. [] J. A. Joao, M. A. Sulean, O. Mutlu, and Y. N. Patt. ottleneck identification and scheduling in ultithreaded applications. In Proceedings of the International Conference on Architectural Support for Prograing Languages and Operating Systes (ASPLOS), pages, Mar.. [] M. T. Jones. Inside the Linux scheduler: The latest version of this all-iportant kernel coponent iproves scalability. http://www.ib.co/developerworks/linux/library/lscheduler/index.htl, June 6. [] R. Kalla,. Sinharoy, W. J. Starke, and M. Floyd. Power7: IM s next-generation server processor. IEEE Micro, :7, March/April. [] C. N. Keltcher, K. J. McGrath, A. Ahed, and P. Conway. The AMD Opteron processor for ultiprocessor servers. IEEE Micro, ():66 76, Mar. 7. [6] K. Khubaib, M. Sulean, M. Hashei, C. Wilkerson, and Y. Patt. MorphCore: An energy-efficient icroarchitecture for high perforance ILP and high throughput TLP. In th Annual IEEE/ACM International Syposiu on Microarchitecture (MICRO), pages 6, Dec..

[7] C. Ki, S. Sethuadhavan, M. S. Govindan, N. Ranganathan, D. Gulati, D. urger, and S. Keckler. Coposable lightweight processors. In Proceedings of the International Syposiu on Microarchitecture (MICRO), pages 8 9, Dec. 7. [8] R. Kuar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA heterogeneous ulti-core architectures: The potential for processor power reduction. In Proceedings of the ACM/IEEE Annual International Syposiu on Microarchitecture (MICRO), pages 8 9, Dec.. [9] R. Kuar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas. Single-ISA heterogeneous ulti-core architectures for ultithreaded workload perforance. In Proceedings of the International Syposiu on Coputer Architecture (ISCA), pages 6 7, June. [] S. Li, J. H. Ahn, R. D. Strong, J.. rockan, D. M. Tullsen, and N. P. Jouppi. McPAT: An integrated power, area, and tiing odeling fraework for ulticore and anycore architectures. In Proceedings of the IEEE/ACM International Syposiu on Microarchitecture (MICRO), pages 69 8, Dec. 9. [] Y. Li, D. rooks, Z. Hu, and K. Skadron. Perforance, energy, and theral considerations for SMT and CMP architectures. In Proceedings of the International Syposiu on High-Perforance Coputer Architecture (HPCA), pages 7 8, Feb.. [] NVidia. Variable SMP a ulti-core CPU architecture for low power and high perforance. http://www.nvidia.co/content/pdf/tegra white papers/ Variable-SMP-A-Multi-Core-CPU-Architecture-for-Low- Power-and-High-Perforance-v..pdf,. [] K. Olukotun,. A. Nayfeh, L. Haond, K. Wilson, and K.-Y. Chang. The case for a single-chip ultiprocessor. In Proceedings of the International Conference on Architectural Support for Prograing Languages and Operating Systes (ASPLOS), pages, Oct. 996. [] S. E. Raasch and S. K. Reinhardt. The ipact of resource partitioning on SMT processors. In Proceedings of the th International Conference on Parallel Architectures and Copilation Techniques (PACT), pages 6, Sept.. [] E. Rote, A. Naveh, D. Rajwan, A. Ananthakrishnan, and E. Weissann. Power-anageent architecture of the intel icroarchitecture code-naed sandy bridge. IEEE Micro, : 7, March/April. [6] T. Sherwood, E. Perelan, G. Haerly, and. Calder. Autoatically characterizing large scale progra behavior. In Proceedings of the International Conference on Architectural Support for Prograing Languages and Operating Systes (ASPLOS), pages 7, Oct.. [7] A. Snavely and D. M. Tullsen. Sybiotic jobscheduling for siultaneous ultithreading processor. In Proceedings of the International Conference on Architectural Support for Prograing Languages and Operating Systes (ASPLOS), pages, Nov.. [8] G. S. Sohi, S. E. reach, and T. N. Vijaykuar. Multiscalar processors. In Proceedings of the nd Annual International Syposiu on Coputer Architecture (ISCA), pages, June 99. [9] M. A. Sulean, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asyetric ulticore architectures. In Proceedings of the International Conference on Architectural Support for Prograing Languages and Operating Systes (ASPLOS), pages 6, Mar. 9. [] M. A. Sulean, O. Mutlu, J. A. Joao, Khubaib, and Y. N. Patt. Data arshaling for ulti-core architectures. In Proceedings of the International Syposiu on Coputer Architecture (ISCA), pages, June. [] D. M. Tullsen, S. J. Eggers, J. S. Eer, H. M. Levy, J. L. Lo, and R. L. Sta. Exploiting choice: Instruction fetch and issue on an ipleentable siultaneous ultithreading processor. In Proceedings of the rd Annual International Syposiu on Coputer Architecture (ISCA), pages 9, May 996. [] R. Velasquez, P. Michaud, and A. Seznec. Selecting benchark cobinations for the evaluation of ulticore throughput. In The IEEE International Syposiu on Perforance Analysis of Systes and Software (ISPASS), pages 7 8, Apr..