Dynamic Scheduling Issues in SMT Architectures

Transcription

1 Dynamic Scheduling Issues in SMT Architectures Chulho Shin System Design Technology Laboratory Samsung Electronics Corporation Seong-Won Lee Dept. of Electrical Engineering - Systems University of Southern California seongwon@usc.edu Jean-Luc Gaudiot Dept. of Electrical Engineering and Computer Science University of California, Irvine gaudiot@uci.edu Abstract Simultaneous Multithreading (SMT) attempts to attain higher processor utilization by allowing instructions from multiple independent threads to coexist in a processor and compete for shared resources. Previous studies have shown, however, that its throughput may be limited by the number of threads. A reason is that a fixed thread scheduling policy cannot be optimal for the varying mixes of threads it may face in an SMT processor. Our Adaptive Dynamic Thread Scheduling (ADTS) was previously proposed to achieve higher utilization by allowing a detector thread to make use of wasted pipeline slots with nominal hardware and software costs. The detector thread adaptively switches between various fetch policies. Our previous study showed that a single fixed thread scheduling policy presents much room (some 30%) for improvement compared to an oracle-scheduled case. In this paper, we take a closer look at ADTS. We implemented the functional model of the ADTS and its software architecture to evaluate various heuristics for determining a better fetch policy for a next scheduling quantum. We report that performance could be improved by as much as 25%. 1. Introduction Simultaneous Multithreading (SMT) or Multithreaded Superscalar Architectures [4, 10, 21, 20, 5, 8] can achieve high processor utilization by allowing multiple independent threads to coexist in the processor pipeline and share resources with support of multiple hardware contexts. SMT is an attempt to overcome low resource utilization of wide-issue singlethreaded superscalar processors by exploiting Thread-Level Parallelism (TLP) at a relatively low hardware cost for supporting the multiple hardware contexts. Studies by Tullsen et al. and Ungerer et al. [21, 16] have shown that when the number of threads simultaneously active in an SMT processor becomes greater than four, performance often saturates and in some cases even degrades. In these studies, an attempt was made to overcome the saturation effect by finding a better fetch mechanism or increasing the number and availability of resources that would otherwise become bottlenecks (such as register files and instruc- The material reported in this paper is based upon work supported in part by the National Science Foundation under Grants No. CSA and INT Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. tion queues). It was also shown that increasing the size of the caches can result in a higher saturation point. Unfortunately, such remedies do not work in all cases because their effectiveness is heavily affected by the properties of the application mixtures. We believe that one fixed thread scheduling policy which performs better than others on the average cannot deliver the performance we anticipate in SMT processors with more than four thread contexts. We will show that with our adaptive dynamic thread scheduling policy [15], we can significantly improve the performance of SMT processors and prevent the saturation or degradation effects alluded to earlier. Our work focuses on multiprogrammed or multi-user environments where combinations of multiple threads that an SMT processor faces are significantly varied over time. For multiprogramming or multi-user workloads consisting of threads running on the processor independently of one another, no information about any interactive behavior between threads may be known in advance. Consequently, it is indispensable to adopt a more intelligent and more dynamic thread scheduling capability if we are to sustain high throughput. When parallelizing an application to generate multiple threads, the role of thread scheduling is to eliminate resource conflicts and avoid data dependencies in order to expose more parallelism. On the contrary, the role of scheduling for multiple independent threads (of multiprogrammed workloads) is to perform a better traffic control so as to sustain higher throughput by maintaining low interference between threads. Tullsen et al. [20] evaluated several fetch policies and showed that the policy yields the best average performance. gives priority to the threads with fewer instructions in the decode stage, the rename stage, and the instruction queues. Actually, best accounts for what is taking place in SMT pipelines in general: since it gives priority to the threads that have fewer instructions in the earlier stages of the pipelines, a balanced use of the instruction window occurs. Since it gives more opportunities to the threads whose instructions drain through the pipeline more rapidly, a more efficient use of the pipeline results. While is the scheduling policy that works best on the average, it does not address problems as directly as other policies such as BRCOUNT and MISSCOUNT 1 do. (BRCOUNT prioritizes threads with fewer conditional branches.) Assume for example that the set of applications 1 See Section 5 for definitions of various fetch policies.

2 in an SMT processor consists of four control-intensive applications (with many conditional branches) and four other applications. Further assume that these four control-intensive applications are experiencing high branch prediction misses at the moment. Then, the processor will suffer from wasted slots filled with wrong-path instructions of the four controlintensive applications while preventing the other four threads from exploiting the resources in the pipeline. In this specific case, if BRCOUNT had been used, the four control-intensive threads would have found fewer chances to get fetched. Consequently, the number of (fetched) instructions of controlintensive threads will diminish while the number of instructions of the other four threads will increase making the number of effective instructions even among all threads. The main goal of a hardware thread scheduler is to avoid imbalance among threads, where imbalance on a resource means that usages or counts of the resource are not even among the threads. For example, if one thread has many more instructions in the early stages of the pipeline (the decode and rename stages and the instruction queue) than the others do, we have an imbalance in terms of instruction count. Imbalance adversely affects the throughput for the following reasons (it would result in lowered Thread-Level Parallelism): Since a small number of threads are occupying one type of resources, the other threads cannot have access to these same resources. The average number of non-dependent and issuable instructions per thread becomes lower for the other threads, lowering the average number of instructions that can proceed through the pipeline. With adaptive dynamic thread scheduling, when a change in the system environment is detected, the fetch policy which should be used during the next interval is decided upon and put into effect to eliminate the problematic imbalance. However, having multiple fetch policies and decision-making algorithms in hardware could translate into high hardware complexity. In our previous work [15], we proposed our detector thread approach which could help lower the hardware requirements and also make use of unused pipeline slots to run decision-making algorithms and fetch policies. Our approach also has the advantage that thread scheduling can be manipulated even after the chip has been produced because the detector thread is programmable. The detector thread can also help lower the overhead of the system job scheduler by shortening its stay in the processor and analyzing information before the job scheduler needs it. In this paper, we take a closer look at the software aspect of ADTS. We propose an effective software architecture for the detector thread. The core of this software is the heuristics for determining the fetch policy that will be used in the next scheduling quantum. We implement and evaluate the functional models of those heuristics. This paper is organized as follows. In section 2, previous works related to our work are summarized. The adaptive dynamic thread scheduling is reviewed in section 3, its software architecture is discussed in section 4 and how we evaluate our idea is discussed in section 5. Results of our simulation experiments are presented in section 6 and analyzed. Summary and conclusions will appear in section Related Work Wang et al. investigated the use of a special thread while aiming at realizing speculative precomputation in one of the two threads available on the Hyper-Threading architecture [22]. The study is targeted at improving the performance of singlethreaded applications on two-context SMT processors. DanSoft [6] proposed the idea of nanothreads in which one nanothread is given the control of the processor upon the stall of a main thread. The idea was based on a CMP with dual VLIW single-threaded cores and its success hinges on the effectiveness of the compiler. Assisted Execution [18] extended the nanothread idea for architectures that allow simultaneous execution of multiple threads including SMT. It attempts to improve the performance of a main thread by having multiple nanothreads perform prefetch and its success also hinges on the operation of the compiler. Speculative data-driven multithreading [14] takes advantage of a speculative thread, called a data-driven thread (DDT) to pre-execute critical computations and consume latency on behalf of the main thread on SMT. This study was also focusing on improving the performance of a main thread. Luk [11] also proposed pre-executing for more effective prefetch for hard-to-predict data addresses using idle threads to boost the performance of a primary thread. Simultaneous Subordinate Microthreading (SSMT) [3] was proposed in an attempt to improve the performance of a single thread by having multiple subordinate microthreads perform useful work such as running sophisticated branch predication algorithms. The idea was not based on an SMT architecture and also requires effective compiler technology. Parekh et. al. [13] investigated issues related to job scheduling for SMT processors. They compared the performance of oblivious and thread-sensitive scheduling. Oblivious scheduling means round-robin and random while threadsensitive scheduling takes into account resource demands and the behavior of each thread. The study concluded that thread-sensitive IPC-based scheduling can achieve a significant speedup over round-robin methods. However, this study concerns system job scheduling and cannot be directly related to dynamic thread scheduling. Also, the job scheduler will have to be brought into the processor, resulting in a context switch of user threads. This job scheduler, however, can take advantage of our detector thread approach and it will be discussedinsection3. Another similar study [17] investigated job scheduling for SMT processors. The study proposed a job scheduling scheme called SOS where an overhead-free sample phase is involved where the performance of various schedules (mixes) is sampled and taken into account for the selection of tasks for the next time slice. We recognize that this strategy can also benefit from our approach because the detector thread will be always active. It could make use of unused pipeline slots and resources to find out what threads should not be selected in the next job scheduling time slice while lowering the burden of the job scheduler. Our adaptive dynamic thread scheduling approach [15] should not be confused with adaptive process scheduling [12] which addresses O/S job scheduling issues for SMT processors: the goal of our approach is to offer more efficient thread scheduling at the individual instruction level in the SMT pipeline.

3 A study that examine approaches to detect per-thread cache behavior using hardware counters and help job scheduling based on the information obtained on SMT was performed by Suh et al. [19]. This approach is similar to our idea of relating the detector thread with job schedulers. However, it does not aim at controlling thread fetch policies. 3. Adaptive Dynamic Thread Scheduling (ADTS) with a Detector Thread (DT) Our Adaptive Dynamic Thread Scheduling (ADTS) was introduced and discussed in details in [15]. Its implementation with a detector thread (DT) was also discussed. The ADTS with a DT tackles two problems: first, a new fetch policy can be activated if the system is suffering from low throughput. Second, it allows unused pipeline slots to be used to detect adverse changes in the system, identify threads that clog the pipeline, and take actions needed to sustain high throughput. The action that can be taken include context-switching a thread and preventing a specific thread from being fetched. A detector thread is a special thread which reads thread status indicators and updates thread control flags based on the current values of the indicators so that the thread control hardware can take any necessary action to improve performance of an SMT processor. The per-thread status indicators are updated by circuitry located throughout the processor pipeline, based upon specific events such as cache miss, pipeline stalls, population at each stage, etc. Per-Thread Counters DT A B C D E F G H Our previous work [15] proposed a way to implement the detector thread based on another study [3]. The detector thread will have its own program cache sufficiently large (2 or 4KB) to fit its small program image and its data accesses should be mostly to special registers such as the per-thread counters and general-purpose registers. Most of the time, the detector thread will be the lowest-priority thread. When the slots are almost fully occupied by normal threads, the detector thread will not obtain any more scheduling slots; this is acceptable because it means that the processor pipeline slots are enjoying high utilization. Fetching the detector thread s instructions should not result in significant overhead either. Since its instructions are coming from their own isolated program cache, they will not compete for fetch bandwidth with other normal threads. It should not affect the data memory bandwidth either because its data will be mostly coming from special registers. Also, it was shown that the detector thread s job can fit within the cycle budget allowed in realistic situations [15]. The detector thread plays a major role in this process as shown in Figure 1. It keeps watching the per-thread status indicators and updates the flags based on its active policy. The indicators are updated by hardware on predetermined events in places spread across the pipeline. The detector thread has the lowest priority among threads. As long as the pipeline is well utilized, the detector thread will not often be activated. Can a detector thread experience starvation in such cases? This depends upon the occupancy rate of the instruction fetch buffer. As long as the instruction fetch buffer is full, no instructions from the detector thread can be fetched. For this detector thread approach to work successfully, it has to be equipped with intelligent heuristics or algorithms to dynamically detect clogging (low throughput) and to choose a better fetch policy for the next time frame. However, since the resources allowed for the detector thread are quite limited in order to minimize hardware overhead, the algorithm is also limited in the data to which it can refer. This will be the topic of the next section. flags Thread Selection Units 4. Software Architecture of the Detector Thread Figure 1. How a Detector Thread works with normal threads. The role of the detector thread is to check the values of the various thread status indicators and, based on the conditions dynamically defined in software, to properly update the thread control flags as shown in Figure 1. A thread will have its own set of flags. A flag may tell whether a thread can be fetched in the next cycle while another flag may tell whether it should be context-switched in the next opportunity. When the system thread is loaded, it will look at the flag and suspend a clogging thread without going through the process of determining which thread to suspend. Then, the thread selection unit simply issues instructions from threads in their order of priority. Although the per-thread status indicators, thread control flags, and thread selection units are fixed in hardware, we can control the thread control behavior around those hardware resources by writing a different program code for the detector thread. The software architecture of the detector thread for adaptive thread scheduling is shown in Figure 2. The status counters are updated at each cycle throughout the pipeline. For every period of 8K cycles, the number of committed instructions are counted and the maximum number of instructions that can be executed (8Kx8) are counted. If the interval is to remain constant, the maximum numbers need not be counted. The detector thread will check whether the IPC (the number of committed instructions per cycle) is less than the threshold. In this case, the previous time frame will be identified as lowthroughput. Once a previous scheduling quantum 2 is determined to be low-throughput, a new fetch policy has to be determined because the incumbent policy (the one which is currently engaged) turned out to perform poorly. Then, the policy that has been decided to be used to replace the incumbent policy for the next scheduling interval is activated. In the meantime, 2 This scheduling quantum should not be confused with that of the job scheduler. Typical sizes of a quantum for job scheduling is in the range of milliseconds which can be equivalent to a million cycles.

4 during the remaining idle slots, other functions can be accomplished. The first thing is to identify the clogging threads. By looking at the per-thread status counters, the threads that are clogging the pipelines for various reasons can be identified and marked so that the job scheduler can later suspend them once loaded without going through the possibly long process of identifying them for itself. This results in a shorter period of activity for the job scheduler. The second thing is to enforce the incumbent policy. Per-thread status counters are checked and the priority array is updated depending on the values of the counters. Then, the thread selection unit will look at the array to make decisions on which two threads should be selected for instruction fetch at each cycle. state and the incumbent policy while the thread selection unit (TSU) examines this array to determine the threads for instruction fetch at each cycle. The TSU selects up to two threads at each cycle because we are using.2.8 [20]. Status Counters Updated No IPC<Threshold Yes Determine New Policy Policy Switch Identify Clogging Threads Figure 3. The framework of a detector thread in pseudo code (abridged) TSU Policy Enforce Figure 2. Software architecture of the detector thread 4.1 Pseudo Code of the Detector Thread The pseudo code of the detector thread is shown in Figure 3. The main subroutine Detector Thread() has a large endless while loop with a jump location right ahead of it, East. If the condition, IPC last < IPC thold holds true, it will be recognized as a low-throughput condition event and the required actions will be taken. IPC last is the committed instructions per cycle during the last eight-kilo-cycle quantum and IPC thold means the threshold value of the IPC which is predetermined by the detector thread management kernel developer. This threshold value may also be chosen to be updated by the detector thread software. Once a low-throughput condition is recognized, Identify CloggingThreads() is called and the cause of the low throughput is analyzed to identify the clogging threads. Determine NewPolicy() is next called to find out the policy that should be engaged in the next quantum. This stage needs the most effort since choosing a new policy will significantly affect the throughput of the next scheduling interval. The new policy is then engaged as the next incumbent policy by the function Policy Switch() and a jump to the subroutine Policy Enforce() is made. In this routine, the thread priority array (TPA) is updated depending on the current system 4.2 Determination of Threshold Values The big question to address before determining the next fetch policy is how we know whether the processor is experiencing low throughput or not. What is the threshold that makes the reference based upon which we can make accurate judgments? Figure 7 illustrates how the value of the IPC threshold affect the frequency of switchings and the quality of a switch. If the threshold value is too low, very little switching will take place while the quality of a switch can often be high (the quality of a switch is high when the switch results in increase of throughput in the next scheduling interval.) In this case, the quality of a switch would be high because when low-throughput is detected, the incumbent policy is less likely to be capable of improving the situation since the threshold value was very low. If the value is too high, switching will occur too frequently. Further, the quality of each switch can be very low since it is more likely that the situation cannot improve even with alternative policies because the current throughput can be fairly high. 4.3 Determination of Next Fetch Policy Underlying Premises Once it turns out that the incumbent fetch policy fails to sustain high throughput, the followings may be taken into consideration to determine a new fetch policy. What was the fetch policy for the last quantum? What are the current conditions? (Instruction counts, cache miss rates, etc.) Is the IPC increasing or decreasing? (Throughput gradient)

5 What has been the history of a fetch policy s effect under a certain condition? The more things we take into consideration, the more sophisticated and informed the determination heuristic becomes. However, too sophisticated heuristics may not fit in the available cycle budget or in the DT PRAM whose size is also limited. The fewer things we take into consideration, the lower the overhead of the detector thread and the quicker the response of the detector thread. However, limiting the sophistication of the scheduling algorithm may result in weak performance. Thus, we need to find the trade-off where the overhead fits our budget while still producing good results. The simplest way to determine the new policy is the fixed transition with no current condition considerations. This will basically be what we do in our Type 1 heuristic (Figure 4). However, it should be noted that switching to another specific thread may worsen an already deteriorating situation instead of improving it if the newly engaged policy does not happen to address the problems the system is currently experiencing. This kind of approach will also heavily rely on the value of the threshold because a higher value of the threshold is more likely to cause such adverse effects while a lower value is less so Various Heuristics The first of the heuristics is called Type 1, the simplest way of determining a new fetch policy. In this scheme, no status indicators are referenced before making a decision and consequently it is not sensitive to the state in which the system currently is. As long as a low throughput condition is not detected, the current state, that is, the incumbent fetch policy will be maintained. Once a low throughput condition has been detected, transition to the other thread (either BRCOUNT or ) will unconditionally be made. Initially, the default fetch policy will be. The advantage of this scheme is that the software overhead of the detector thread will be minimal to a degree that it can be implemented in hardware. However, the advantages of the detector thread such as flexibility and programmability will not be available. BRCOUNT Figure 4. Type 1 heuristic for determination of a new fetch policy Type 2 heuristic is another simple way of determining a new fetch policy. In this scheme, as in Type 1, no status indicators are referenced for decision. The difference (Figure 5) is that one more state (or fetch policy) has been added to the original finite state machine. The variants based on this scheme can be made by changing the sequence of the transitions, which currently is set to the order of, L1MISSCOUNT and BRCOUNT, or adding more fetch policies to the current set of three. Type 1 and Type 2 only considers what was the fetch policy for the last quantum once BRCOUNT L1MISSCOUNT Figure 5. Type 2 heuristic for determination of a new fetch policy low throughput is detected. There is only one state that can be transited to from a state. Thus, as long as low throughput is not avoided, one of the two or three states will be entered in a cyclic fashion. In Type 3 heuristic (Figure 6), one of the two states can be entered from a state depending on the value of some specific conditions. Depending on the value of a condition, the transition is made to the policy that is reckoned to improve throughput with the current condition. Type 3 heuristic relies on the following conditions: COND MEM is true when one of the following two subconditions is true. 1. L1 miss count for the last quantum is higher than its threshold value of 0.19 times/cycle. 2. Load/Store Queue becomes full too often, more often that its threshold value of 0.45 times/cycle. COND BR is true when one of the following two subconditions is true. 1. Branch misprediction count for the last quantum is higher than its threshold value of 0.02 times/cycle. 2. The count of conditional branches for the last quantum is higher than its threshold value of 0.38 branches/cycle. Above, the specific threshold values for L1 miss count, Load/Store Queue occupancy rate, Branch count and it misprediction count were determined by simulation. We ran eight-thread simulation in our SMT simulator with our 13 different mixes of applications and ended up with an average value for each metric. These measures are indeed dependent on hardware configurations and what kind of mixes are running in the processor. There can be no single golden reference measures that can always be used. To be more effective, the threshold values should be updated to reflect newly found information. That is one of the reasons why the detector thread approach is good for the adaptive dynamic thread scheduling. The system s detector thread management kernel can profile the system and determine whether current threshold numbers are obsolete and if so, it may update the values to reflect the new state of the system. This update can be done by writing values in the detector thread s DT DRAM through DMA [15].

6 !COND_MEM Fetch Policies BRCOUNT BRCOUNT Number of total branches for a thread COND_BR LDCOUNT Number of total loads for a thread MEMCOUNT Number of total memory accesses for a thread COND_MEM COND_BR!COND_BR!COND_BR L1MISS COUNT L1IMISS COUNT L1DMISS COUNT Number of total L1 Cache misses for a thread Number of total L1 ICache misses for a thread Number of total L1 DCache misses for a thread Current Instruction Queue population for a thread L1MISSCOUNT ACCIPC STALL COUNT RR Accumulated IPC for a thread Number of total stalls incurred for a thread Round-Robin scheduling Figure 6. Type 3 heuristic for determination of a new fetch policy Type 3 heuristic works as follows. Suppose that BR- COUNT is the incumbent fetch policy when low throughput is detected. It implies that BRCOUNT has not worked well during the last quantum and there is no crucial imbalance among threads of the current set about conditional branches; imbalance might be in other factors. Now we can guess that one of the other policies, or L1MISSCOUNT may work better. Now, we consider the condition, COND MEM and check its value. If it holds true, then it implies that the imbalance might have been in the number of L1 cache misses or the usage of the load/store queue. Thus the transition will be made to L1MISSCOUNT. Otherwise, the problem might not lie in memory usages and the transition will be made to which works best on the average. For another type of heuristic, Type 4, we add two features. The first one is to take into account the gradient of the throughput. Even when low throughput is detected, if the throughput is higher than the throughput observed one quantum earlier (positive gradient), switching policies is not allowed. That way, we are waiting for the situation to keep improving with the original fetch policy. The second feature is to keep track of the switching history. In the switching history buffer, the followings are recorded for each policy switching event. Incumbent policy: The fetch policy that is originally engaged before a switching takes place. Value of the condition: For each policy, there is one condition that is checked. The value of the condition is recorded. Counter for positive outcomes (poscnt): This counter is incremented every time a specific case ended up with increase in throughput. Counter for negative outcomes (negcnt): This counter is incremented every time a specific case ended up with decrease in throughput. Before making the final decision, poscnt and negcnt are compared. If poscnt is greater, then a regular switching is made. Otherwise, the opposite direction will be chosen. For instance, suppose the incumbent policy was and low throughput is detected. Then with COND BR being true, transition should have been toward BRCOUNT policy with Type 3 heuristic. In Type 4, the counters (poscnt and negcnt) are examined and if poscnt is not greater than negcnt, the transition will be made toward the opposite, L1MISSCOUNT. Table 1. Various Fetch Policies tested 5. Methodology We used the SimpleSMT simulator [9] which is an extension of the SimpleScalar tool set [1]. It thus inherits most architectural specifications of the superscalar model in SimpleScalar. The main architectural difference between SimpleSMT and SimpleScalar is that SimpleSMT has separate integer and floating-point instruction queues and more pipeline stages to reflect the additional complexity of SMT. The simulation environment has been configured to have resources compatible with previous research on SMT [20] (for verification purposes) as we did in our previous work [15]. We used SPEC CPU2000 [7] as our simulation workloads and formed thirteen program mixtures depending on each program s properties: IPC on a single threaded machine model, memory footprint and whether an application requires floating-point operations or not. For combinations with a mix of integer and floating-point applications, we attempted to make the mix as even as possible. For simulation of 4- and 6-thread cases, some applications were randomly chosen to be excluded from the 8-thread mixes. We modeled ten different fetch policies as shown in Table 1. BRCOUNT, L1DMISS COUNT, and RR were proposed and evaluated in [20]. Additionally, we included in our list LDCOUNT, MEMCOUNT, ACCIPC and STALL COUNT. The description of each policy is found in the table. L1MISS COUNT and L1IMISS COUNT were added to have a closer look at the effect of the caches. At each cycle, the simulator sorts out threads according to the fetch policy. Instructions are fetched from the first thread as long as the cache block boundary is not met. If no boundary is encountered, all eight instructions are fetched from one thread. Otherwise, instructions can be fetched from the next thread. We limited the number of threads that can be fetched in one cycle to two. A study [2] showed that fetching all eight instructions from one thread can adversely affect the performance due to fetch fragmentation. For fair comparison, we applied the same mechanism to both fixed scheduling and adaptive scheduling. Because of the huge size of the SPEC 2000 applications, it is almost impossible to run simulations until the end of all programs. Since the reference mode of a typical SPEC 2000 application has an average of 200 billion instructions, it would take about three months to completely run one application since the performance of our simulator is about 25K instructions per second. To lower the time requirement and still get accurate sim-

7 ulation results, we ran simulation for a million cycles in ten randomly chosen different intervals by taking advantage of the fast-forward feature of the SimpleScalar simulator [1]. 6. Experimental Results Figure 7 a) and c) verify what we had surmised in 4.2. As the threshold value increases, more switchings incur for all types of heuristics. The quality of a switch decreases as the threshold value, but not as fast as the number of switchings increases. Note that with Type 1 and Type 2, it is not the case; the quality of the switch may be higher with the threshold value of 3 than with 2. Figure 7 b) and d) shows how the policy determination heuristic type affects the frequency and quality of switchings. Type 3 represents the Type 3 heuristic plus considering gradient of throughput. It is interesting to note that Type 4 heuristic results in more low-quality (malignant) switchings. This implies that determining a new fetch policy based on historical performance is not effective. aggregate IPC IPC Average for All Combinations Threshold Value (a) IPC vs. threshold values Average for All Combinations Threshold Value Type 4 Type 3' Type 3 Type 2 Type 1 Type 1 Type 2 Type 3 Type 3' Type 4 (c) IPC vs. threshold value for each type IPC aggregate IPC Average for All Combinations Type 1 Type 2 Type 3 Type 3' Type 4 Policy Determination Heurstic Type (b) IPC vs. type Average for All Combinations Type 1 Type 2 Type 3 Type 3' Type 4 Policy Determination Heurstic Type (d) IPC vs. type for each threshold value m=5 m=4 m=3 m=2 m=1 m=1 m=2 m=3 m=4 m= switches Type 4 Type 3' Type 3 Type 2 Type 1 switches Figure 8. Effect of the threshold value and policy determination heuristic on throughput (average of all mixtures) Type 1 Type 2 Type 3 Type 3' Type 4 Threshold Value Policy Determination Heurstic Type (a) Number of switchings vs. threshold value Probability of Benign Switches (aggregate) Type 4 Type 3' Type 3 Type 2 Type 1 (b) Number of switchings vs. type Probability of Benign Switches (aggregate) Type 1 Type 2 Type 3 Type 3' Type tem calls (like SimpleScalar) and, instead, it translates them into the host system calls for efficient simulation sacrificing accuracy. We assumed that when a thread encounters a system call, all threads have to flush out of the pipeline before the system call can be started, which is the most conservative assumption. In real situations, if the system call is for critical operations like memory allocation, that will be the case because such operations may affect all threads resident in the processor. Threshold Value (c) Probability of benign switches vs. threshold value Policy Determination Heurstic Type (d) Probability of benign switches vs. type Figure 7. Effect of the threshold value on switch occurrence and quality Figure 8 shows the effects of the IPC threshold value on the throughput. Obviously, the best performance is reached when the threshold value is 2 and Type 3 heuristic is used. The maximum performance improvement over is about 30%. The values in the graphs are actually the one averaged over all various mixtures. We also found that greater improvements can be achieved when more similar applications are found in a mixture. With a mixture of various applications, less improvement was achieved. It should also be noted that the throughput values observed in this experiment are relatively low considering the number of threads involved. The reason lies in the configuration we chose for simulation. The SimpleSMT does not simulate sys- 7. Summary and Conclusion This paper has investigated how much more improvement can be made by allowing an adaptive dynamic thread scheduling approach rather than the fixed scheduling approaches employed in earlier work. It proposed the detector thread approach to implement adaptive scheduling with low hardware and software overhead. The detector thread is a special thread that occupies one designated thread context with minimal extra hardware. It is scheduled for execution when idle slots are available. To validate the idea, we used the SimpleSMT simulator to derive an upper-bound for the performance improvement we can hope to achieve using our approach. SPEC 2000 applications were used to create thirteen various mixes of applications based on single-application performance, memory footprint and type (integer or floating-point). Simulation results showed that there still is significant room (27%) for performance improvement over fixed scheduling for eight threads on which adaptive scheduling can work. This paper stresses that adaptive scheduling is feasible because our platform is SMT where it is possible to have

8 one thread resident in the processor with minimal overhead. The results we obtained in this study are greatly encouraging. Since SMT was introduced, studies have shown that having too many threads (usually more than four or five) will not return the expected throughput increase and sometimes even lower the throughput. Our study has shown that adaptive thread scheduling in combination with a detector thread can significantly extend the saturation point in terms of number of threads provided that the detector thread is programmed with effective low-throughput detection and fetch policy selection algorithms. The software architecture for the detector thread was developed and various heuristics were evaluated for determining the fetch policy to be used for the next scheduling quantum. Type 3 turned out to work best with the threshold value of 2. Type 4 which keeps track of outcomes of earlier decisions turned out not worthy of the efforts because there seemed to be no correlation in time domain regarding the fetch policies because there is no fixed pattern about the interactions between independent threads. Once the job scheduler is put into the picture, because more dynamic change in the set of applications is going to take place, correlation in time domain will be even harder to find. We also found that with a mixture of various applications, less improvement was achieved with the ADTS over the fixed scheduling of. That is because we have a good mixture of applications so that we can maintain high utilization of various resources available in the processor. Consequently, we may ask the following question. Why not we just let the job scheduler concentrate on co-scheduling wellbalanced sets of applications? Then, will work well and not much improvement can be made over it with the adaptive scheduling. Our answer to the question is no. There are two reasons for the answer. The first one is that the job scheduler cannot co-schedule well-balanced sets of applications all the time, especially when the number of jobs available in the system is not significantly larger than the number of the hardware contexts of an SMT processor. The other reason being is that the job scheduler would have to stay on the processor for significantly longer duration had it not been for the detector thread. References [1] T. Austin. The SimpleScalar Architectural Research Tool Set, Version 2.0. Technical Report 1342, University of Wisconsin- Madison, June [2] J. Burns and J.-L. Gaudiot. Exploring the SMT Fetch Bottleneck. In Proceedings of the Workshop on Multithreaded Execution, Architecture and Compilation (MTEAC99), Orlando, Florida, January [3] R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous Subordinate Microthreading (SSMT). In Proceedings of the 26th Annual International Symposium on Computer Architecture, pages , May [4] S. Eggers, J. Emer, H. Levy, J. Lo, R. Stamm, and D. Tullsen. Simultaneous Multithreading: A Platform for Next-Generation Processors. IEEE Micro, pages 12 18, September/October [5] M. Gulati and N. Bagherzadeh. Performance Study of a Multithreaded Superscalar Microprocessor. In Proceedings of the 2nd International Symposium on High Performance Computer Architecture, pages , Feburary [6] L. Gwenlapp. Dansoft Develops VLIW Design. Microprocessor Report, 11(2):18 22, Feburary [7] J. Henning. SPEC CPU2000: Measuring CPU Performance in the New Millennium. IEEE Computer, 33(7):28 35, July [8] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa. An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threadds. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages , May [9] S. Lee and J.-L. Gaudiot. ALPSS: Architectural Level Power Simulator for Simultaneous Multithreading, Version 1.0. Technical Report TR-02-04, University of Southern California, April [10] J.Lo,S.Eggers,J.Emer,H.Levy,R.Stamm,andD.Tullsen. Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading. ACM Transactions on Computer Systems, pages , August [11] C. Luk. Tolerating Memory Latency through Software- Controlled Pre-Execution in Simultaneous Multithreading Processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 40 51, June [12] M. McCormick, J. Ledlie, and O. Zaki. Adaptively Scheduling Processes on a Simultaneous Multithreading Processor. Technical report, University of Wisconsin - Madison, [13] S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-Sensitive Scheduling for SMT Processors. Technical report, University of Washington, [14] A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proceedings of the 7th International Symposium on High Performance Computer Architecture, pages 37 48, Monterrey, Mexico, January [15] C. Shin, S. Lee, and J.-L. Gaudiot. The Need for Adaptive Dynamic Thread Scheduling in Simultaneous Multithreading. In Proceedings of the 1st Workshop on Hardware/Software Support for Parallel and Distributed Scientific and Engineering Computing (SPDSEC-02) in conjunction with the 11th International Conference on Parallel Architectures and Compilation Techniques (PACT-02), September [16] U. Sigmund and T. Ungerer. Evaluating a Multithreaded Superscalar Microprocessor versus a Multiprocessor Chip. In Proc. of the 4 th PASA Workshop Parallel Systems and Algorithms, pages , April [17] A. Snavely and D. Tullsen. Symbiotic Jobscheduling for a Simultaneous Multithreading Architecture. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages , Cambridge, Massachussets, November [18] Y. Song and M. Dubois. Assisted Execution. Technical Report Technical Report CENG 98-25, University of Southern California, [19] G. E. Suh, S. Devadas, and L. Rudolph. A New Memory Monitoring Scheme for Memory-Aware Scheduling. In Proceedings of the High Performance Computer Architecture (HPCA 02) Conference, Feburary [20] D.Tullsen,S.Eggers,J.Emer,H.Levy,J.Lo,andR.Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages , May [21] D. Tullsen, S. Eggers, and H. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages , June [22] H. Wang, P. Wang, R. Weldon, and et. al. Speculative Precomputation: Exploring the Use of Multithreading for Latency. Intel Technology Journal, 6(1), Feburary 2002.