Dynamic Scheduling Issues in SMT Architectures

Size: px
Start display at page:

Download "Dynamic Scheduling Issues in SMT Architectures"

Transcription

1 Dynamic Scheduling Issues in SMT Architectures Chulho Shin System Design Technology Laboratory Samsung Electronics Corporation Seong-Won Lee Dept. of Electrical Engineering - Systems University of Southern California seongwon@usc.edu Jean-Luc Gaudiot Dept. of Electrical Engineering and Computer Science University of California, Irvine gaudiot@uci.edu Abstract Simultaneous Multithreading (SMT) attempts to attain higher processor utilization by allowing instructions from multiple independent threads to coexist in a processor and compete for shared resources. Previous studies have shown, however, that its throughput may be limited by the number of threads. A reason is that a fixed thread scheduling policy cannot be optimal for the varying mixes of threads it may face in an SMT processor. Our Adaptive Dynamic Thread Scheduling (ADTS) was previously proposed to achieve higher utilization by allowing a detector thread to make use of wasted pipeline slots with nominal hardware and software costs. The detector thread adaptively switches between various fetch policies. Our previous study showed that a single fixed thread scheduling policy presents much room (some 30%) for improvement compared to an oracle-scheduled case. In this paper, we take a closer look at ADTS. We implemented the functional model of the ADTS and its software architecture to evaluate various heuristics for determining a better fetch policy for a next scheduling quantum. We report that performance could be improved by as much as 25%. 1. Introduction Simultaneous Multithreading (SMT) or Multithreaded Superscalar Architectures [4, 10, 21, 20, 5, 8] can achieve high processor utilization by allowing multiple independent threads to coexist in the processor pipeline and share resources with support of multiple hardware contexts. SMT is an attempt to overcome low resource utilization of wide-issue singlethreaded superscalar processors by exploiting Thread-Level Parallelism (TLP) at a relatively low hardware cost for supporting the multiple hardware contexts. Studies by Tullsen et al. and Ungerer et al. [21, 16] have shown that when the number of threads simultaneously active in an SMT processor becomes greater than four, performance often saturates and in some cases even degrades. In these studies, an attempt was made to overcome the saturation effect by finding a better fetch mechanism or increasing the number and availability of resources that would otherwise become bottlenecks (such as register files and instruc- The material reported in this paper is based upon work supported in part by the National Science Foundation under Grants No. CSA and INT Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. tion queues). It was also shown that increasing the size of the caches can result in a higher saturation point. Unfortunately, such remedies do not work in all cases because their effectiveness is heavily affected by the properties of the application mixtures. We believe that one fixed thread scheduling policy which performs better than others on the average cannot deliver the performance we anticipate in SMT processors with more than four thread contexts. We will show that with our adaptive dynamic thread scheduling policy [15], we can significantly improve the performance of SMT processors and prevent the saturation or degradation effects alluded to earlier. Our work focuses on multiprogrammed or multi-user environments where combinations of multiple threads that an SMT processor faces are significantly varied over time. For multiprogramming or multi-user workloads consisting of threads running on the processor independently of one another, no information about any interactive behavior between threads may be known in advance. Consequently, it is indispensable to adopt a more intelligent and more dynamic thread scheduling capability if we are to sustain high throughput. When parallelizing an application to generate multiple threads, the role of thread scheduling is to eliminate resource conflicts and avoid data dependencies in order to expose more parallelism. On the contrary, the role of scheduling for multiple independent threads (of multiprogrammed workloads) is to perform a better traffic control so as to sustain higher throughput by maintaining low interference between threads. Tullsen et al. [20] evaluated several fetch policies and showed that the policy yields the best average performance. gives priority to the threads with fewer instructions in the decode stage, the rename stage, and the instruction queues. Actually, best accounts for what is taking place in SMT pipelines in general: since it gives priority to the threads that have fewer instructions in the earlier stages of the pipelines, a balanced use of the instruction window occurs. Since it gives more opportunities to the threads whose instructions drain through the pipeline more rapidly, a more efficient use of the pipeline results. While is the scheduling policy that works best on the average, it does not address problems as directly as other policies such as BRCOUNT and MISSCOUNT 1 do. (BRCOUNT prioritizes threads with fewer conditional branches.) Assume for example that the set of applications 1 See Section 5 for definitions of various fetch policies.

2 in an SMT processor consists of four control-intensive applications (with many conditional branches) and four other applications. Further assume that these four control-intensive applications are experiencing high branch prediction misses at the moment. Then, the processor will suffer from wasted slots filled with wrong-path instructions of the four controlintensive applications while preventing the other four threads from exploiting the resources in the pipeline. In this specific case, if BRCOUNT had been used, the four control-intensive threads would have found fewer chances to get fetched. Consequently, the number of (fetched) instructions of controlintensive threads will diminish while the number of instructions of the other four threads will increase making the number of effective instructions even among all threads. The main goal of a hardware thread scheduler is to avoid imbalance among threads, where imbalance on a resource means that usages or counts of the resource are not even among the threads. For example, if one thread has many more instructions in the early stages of the pipeline (the decode and rename stages and the instruction queue) than the others do, we have an imbalance in terms of instruction count. Imbalance adversely affects the throughput for the following reasons (it would result in lowered Thread-Level Parallelism): Since a small number of threads are occupying one type of resources, the other threads cannot have access to these same resources. The average number of non-dependent and issuable instructions per thread becomes lower for the other threads, lowering the average number of instructions that can proceed through the pipeline. With adaptive dynamic thread scheduling, when a change in the system environment is detected, the fetch policy which should be used during the next interval is decided upon and put into effect to eliminate the problematic imbalance. However, having multiple fetch policies and decision-making algorithms in hardware could translate into high hardware complexity. In our previous work [15], we proposed our detector thread approach which could help lower the hardware requirements and also make use of unused pipeline slots to run decision-making algorithms and fetch policies. Our approach also has the advantage that thread scheduling can be manipulated even after the chip has been produced because the detector thread is programmable. The detector thread can also help lower the overhead of the system job scheduler by shortening its stay in the processor and analyzing information before the job scheduler needs it. In this paper, we take a closer look at the software aspect of ADTS. We propose an effective software architecture for the detector thread. The core of this software is the heuristics for determining the fetch policy that will be used in the next scheduling quantum. We implement and evaluate the functional models of those heuristics. This paper is organized as follows. In section 2, previous works related to our work are summarized. The adaptive dynamic thread scheduling is reviewed in section 3, its software architecture is discussed in section 4 and how we evaluate our idea is discussed in section 5. Results of our simulation experiments are presented in section 6 and analyzed. Summary and conclusions will appear in section Related Work Wang et al. investigated the use of a special thread while aiming at realizing speculative precomputation in one of the two threads available on the Hyper-Threading architecture [22]. The study is targeted at improving the performance of singlethreaded applications on two-context SMT processors. DanSoft [6] proposed the idea of nanothreads in which one nanothread is given the control of the processor upon the stall of a main thread. The idea was based on a CMP with dual VLIW single-threaded cores and its success hinges on the effectiveness of the compiler. Assisted Execution [18] extended the nanothread idea for architectures that allow simultaneous execution of multiple threads including SMT. It attempts to improve the performance of a main thread by having multiple nanothreads perform prefetch and its success also hinges on the operation of the compiler. Speculative data-driven multithreading [14] takes advantage of a speculative thread, called a data-driven thread (DDT) to pre-execute critical computations and consume latency on behalf of the main thread on SMT. This study was also focusing on improving the performance of a main thread. Luk [11] also proposed pre-executing for more effective prefetch for hard-to-predict data addresses using idle threads to boost the performance of a primary thread. Simultaneous Subordinate Microthreading (SSMT) [3] was proposed in an attempt to improve the performance of a single thread by having multiple subordinate microthreads perform useful work such as running sophisticated branch predication algorithms. The idea was not based on an SMT architecture and also requires effective compiler technology. Parekh et. al. [13] investigated issues related to job scheduling for SMT processors. They compared the performance of oblivious and thread-sensitive scheduling. Oblivious scheduling means round-robin and random while threadsensitive scheduling takes into account resource demands and the behavior of each thread. The study concluded that thread-sensitive IPC-based scheduling can achieve a significant speedup over round-robin methods. However, this study concerns system job scheduling and cannot be directly related to dynamic thread scheduling. Also, the job scheduler will have to be brought into the processor, resulting in a context switch of user threads. This job scheduler, however, can take advantage of our detector thread approach and it will be discussedinsection3. Another similar study [17] investigated job scheduling for SMT processors. The study proposed a job scheduling scheme called SOS where an overhead-free sample phase is involved where the performance of various schedules (mixes) is sampled and taken into account for the selection of tasks for the next time slice. We recognize that this strategy can also benefit from our approach because the detector thread will be always active. It could make use of unused pipeline slots and resources to find out what threads should not be selected in the next job scheduling time slice while lowering the burden of the job scheduler. Our adaptive dynamic thread scheduling approach [15] should not be confused with adaptive process scheduling [12] which addresses O/S job scheduling issues for SMT processors: the goal of our approach is to offer more efficient thread scheduling at the individual instruction level in the SMT pipeline.

3 A study that examine approaches to detect per-thread cache behavior using hardware counters and help job scheduling based on the information obtained on SMT was performed by Suh et al. [19]. This approach is similar to our idea of relating the detector thread with job schedulers. However, it does not aim at controlling thread fetch policies. 3. Adaptive Dynamic Thread Scheduling (ADTS) with a Detector Thread (DT) Our Adaptive Dynamic Thread Scheduling (ADTS) was introduced and discussed in details in [15]. Its implementation with a detector thread (DT) was also discussed. The ADTS with a DT tackles two problems: first, a new fetch policy can be activated if the system is suffering from low throughput. Second, it allows unused pipeline slots to be used to detect adverse changes in the system, identify threads that clog the pipeline, and take actions needed to sustain high throughput. The action that can be taken include context-switching a thread and preventing a specific thread from being fetched. A detector thread is a special thread which reads thread status indicators and updates thread control flags based on the current values of the indicators so that the thread control hardware can take any necessary action to improve performance of an SMT processor. The per-thread status indicators are updated by circuitry located throughout the processor pipeline, based upon specific events such as cache miss, pipeline stalls, population at each stage, etc. Per-Thread Counters DT A B C D E F G H Our previous work [15] proposed a way to implement the detector thread based on another study [3]. The detector thread will have its own program cache sufficiently large (2 or 4KB) to fit its small program image and its data accesses should be mostly to special registers such as the per-thread counters and general-purpose registers. Most of the time, the detector thread will be the lowest-priority thread. When the slots are almost fully occupied by normal threads, the detector thread will not obtain any more scheduling slots; this is acceptable because it means that the processor pipeline slots are enjoying high utilization. Fetching the detector thread s instructions should not result in significant overhead either. Since its instructions are coming from their own isolated program cache, they will not compete for fetch bandwidth with other normal threads. It should not affect the data memory bandwidth either because its data will be mostly coming from special registers. Also, it was shown that the detector thread s job can fit within the cycle budget allowed in realistic situations [15]. The detector thread plays a major role in this process as shown in Figure 1. It keeps watching the per-thread status indicators and updates the flags based on its active policy. The indicators are updated by hardware on predetermined events in places spread across the pipeline. The detector thread has the lowest priority among threads. As long as the pipeline is well utilized, the detector thread will not often be activated. Can a detector thread experience starvation in such cases? This depends upon the occupancy rate of the instruction fetch buffer. As long as the instruction fetch buffer is full, no instructions from the detector thread can be fetched. For this detector thread approach to work successfully, it has to be equipped with intelligent heuristics or algorithms to dynamically detect clogging (low throughput) and to choose a better fetch policy for the next time frame. However, since the resources allowed for the detector thread are quite limited in order to minimize hardware overhead, the algorithm is also limited in the data to which it can refer. This will be the topic of the next section. flags Thread Selection Units 4. Software Architecture of the Detector Thread Figure 1. How a Detector Thread works with normal threads. The role of the detector thread is to check the values of the various thread status indicators and, based on the conditions dynamically defined in software, to properly update the thread control flags as shown in Figure 1. A thread will have its own set of flags. A flag may tell whether a thread can be fetched in the next cycle while another flag may tell whether it should be context-switched in the next opportunity. When the system thread is loaded, it will look at the flag and suspend a clogging thread without going through the process of determining which thread to suspend. Then, the thread selection unit simply issues instructions from threads in their order of priority. Although the per-thread status indicators, thread control flags, and thread selection units are fixed in hardware, we can control the thread control behavior around those hardware resources by writing a different program code for the detector thread. The software architecture of the detector thread for adaptive thread scheduling is shown in Figure 2. The status counters are updated at each cycle throughout the pipeline. For every period of 8K cycles, the number of committed instructions are counted and the maximum number of instructions that can be executed (8Kx8) are counted. If the interval is to remain constant, the maximum numbers need not be counted. The detector thread will check whether the IPC (the number of committed instructions per cycle) is less than the threshold. In this case, the previous time frame will be identified as lowthroughput. Once a previous scheduling quantum 2 is determined to be low-throughput, a new fetch policy has to be determined because the incumbent policy (the one which is currently engaged) turned out to perform poorly. Then, the policy that has been decided to be used to replace the incumbent policy for the next scheduling interval is activated. In the meantime, 2 This scheduling quantum should not be confused with that of the job scheduler. Typical sizes of a quantum for job scheduling is in the range of milliseconds which can be equivalent to a million cycles.

4 during the remaining idle slots, other functions can be accomplished. The first thing is to identify the clogging threads. By looking at the per-thread status counters, the threads that are clogging the pipelines for various reasons can be identified and marked so that the job scheduler can later suspend them once loaded without going through the possibly long process of identifying them for itself. This results in a shorter period of activity for the job scheduler. The second thing is to enforce the incumbent policy. Per-thread status counters are checked and the priority array is updated depending on the values of the counters. Then, the thread selection unit will look at the array to make decisions on which two threads should be selected for instruction fetch at each cycle. state and the incumbent policy while the thread selection unit (TSU) examines this array to determine the threads for instruction fetch at each cycle. The TSU selects up to two threads at each cycle because we are using.2.8 [20]. Status Counters Updated No IPC<Threshold Yes Determine New Policy Policy Switch Identify Clogging Threads Figure 3. The framework of a detector thread in pseudo code (abridged) TSU Policy Enforce Figure 2. Software architecture of the detector thread 4.1 Pseudo Code of the Detector Thread The pseudo code of the detector thread is shown in Figure 3. The main subroutine Detector Thread() has a large endless while loop with a jump location right ahead of it, East. If the condition, IPC last < IPC thold holds true, it will be recognized as a low-throughput condition event and the required actions will be taken. IPC last is the committed instructions per cycle during the last eight-kilo-cycle quantum and IPC thold means the threshold value of the IPC which is predetermined by the detector thread management kernel developer. This threshold value may also be chosen to be updated by the detector thread software. Once a low-throughput condition is recognized, Identify CloggingThreads() is called and the cause of the low throughput is analyzed to identify the clogging threads. Determine NewPolicy() is next called to find out the policy that should be engaged in the next quantum. This stage needs the most effort since choosing a new policy will significantly affect the throughput of the next scheduling interval. The new policy is then engaged as the next incumbent policy by the function Policy Switch() and a jump to the subroutine Policy Enforce() is made. In this routine, the thread priority array (TPA) is updated depending on the current system 4.2 Determination of Threshold Values The big question to address before determining the next fetch policy is how we know whether the processor is experiencing low throughput or not. What is the threshold that makes the reference based upon which we can make accurate judgments? Figure 7 illustrates how the value of the IPC threshold affect the frequency of switchings and the quality of a switch. If the threshold value is too low, very little switching will take place while the quality of a switch can often be high (the quality of a switch is high when the switch results in increase of throughput in the next scheduling interval.) In this case, the quality of a switch would be high because when low-throughput is detected, the incumbent policy is less likely to be capable of improving the situation since the threshold value was very low. If the value is too high, switching will occur too frequently. Further, the quality of each switch can be very low since it is more likely that the situation cannot improve even with alternative policies because the current throughput can be fairly high. 4.3 Determination of Next Fetch Policy Underlying Premises Once it turns out that the incumbent fetch policy fails to sustain high throughput, the followings may be taken into consideration to determine a new fetch policy. What was the fetch policy for the last quantum? What are the current conditions? (Instruction counts, cache miss rates, etc.) Is the IPC increasing or decreasing? (Throughput gradient)

5 What has been the history of a fetch policy s effect under a certain condition? The more things we take into consideration, the more sophisticated and informed the determination heuristic becomes. However, too sophisticated heuristics may not fit in the available cycle budget or in the DT PRAM whose size is also limited. The fewer things we take into consideration, the lower the overhead of the detector thread and the quicker the response of the detector thread. However, limiting the sophistication of the scheduling algorithm may result in weak performance. Thus, we need to find the trade-off where the overhead fits our budget while still producing good results. The simplest way to determine the new policy is the fixed transition with no current condition considerations. This will basically be what we do in our Type 1 heuristic (Figure 4). However, it should be noted that switching to another specific thread may worsen an already deteriorating situation instead of improving it if the newly engaged policy does not happen to address the problems the system is currently experiencing. This kind of approach will also heavily rely on the value of the threshold because a higher value of the threshold is more likely to cause such adverse effects while a lower value is less so Various Heuristics The first of the heuristics is called Type 1, the simplest way of determining a new fetch policy. In this scheme, no status indicators are referenced before making a decision and consequently it is not sensitive to the state in which the system currently is. As long as a low throughput condition is not detected, the current state, that is, the incumbent fetch policy will be maintained. Once a low throughput condition has been detected, transition to the other thread (either BRCOUNT or ) will unconditionally be made. Initially, the default fetch policy will be. The advantage of this scheme is that the software overhead of the detector thread will be minimal to a degree that it can be implemented in hardware. However, the advantages of the detector thread such as flexibility and programmability will not be available. BRCOUNT Figure 4. Type 1 heuristic for determination of a new fetch policy Type 2 heuristic is another simple way of determining a new fetch policy. In this scheme, as in Type 1, no status indicators are referenced for decision. The difference (Figure 5) is that one more state (or fetch policy) has been added to the original finite state machine. The variants based on this scheme can be made by changing the sequence of the transitions, which currently is set to the order of, L1MISSCOUNT and BRCOUNT, or adding more fetch policies to the current set of three. Type 1 and Type 2 only considers what was the fetch policy for the last quantum once BRCOUNT L1MISSCOUNT Figure 5. Type 2 heuristic for determination of a new fetch policy low throughput is detected. There is only one state that can be transited to from a state. Thus, as long as low throughput is not avoided, one of the two or three states will be entered in a cyclic fashion. In Type 3 heuristic (Figure 6), one of the two states can be entered from a state depending on the value of some specific conditions. Depending on the value of a condition, the transition is made to the policy that is reckoned to improve throughput with the current condition. Type 3 heuristic relies on the following conditions: COND MEM is true when one of the following two subconditions is true. 1. L1 miss count for the last quantum is higher than its threshold value of 0.19 times/cycle. 2. Load/Store Queue becomes full too often, more often that its threshold value of 0.45 times/cycle. COND BR is true when one of the following two subconditions is true. 1. Branch misprediction count for the last quantum is higher than its threshold value of 0.02 times/cycle. 2. The count of conditional branches for the last quantum is higher than its threshold value of 0.38 branches/cycle. Above, the specific threshold values for L1 miss count, Load/Store Queue occupancy rate, Branch count and it misprediction count were determined by simulation. We ran eight-thread simulation in our SMT simulator with our 13 different mixes of applications and ended up with an average value for each metric. These measures are indeed dependent on hardware configurations and what kind of mixes are running in the processor. There can be no single golden reference measures that can always be used. To be more effective, the threshold values should be updated to reflect newly found information. That is one of the reasons why the detector thread approach is good for the adaptive dynamic thread scheduling. The system s detector thread management kernel can profile the system and determine whether current threshold numbers are obsolete and if so, it may update the values to reflect the new state of the system. This update can be done by writing values in the detector thread s DT DRAM through DMA [15].

6 !COND_MEM Fetch Policies BRCOUNT BRCOUNT Number of total branches for a thread COND_BR LDCOUNT Number of total loads for a thread MEMCOUNT Number of total memory accesses for a thread COND_MEM COND_BR!COND_BR!COND_BR L1MISS COUNT L1IMISS COUNT L1DMISS COUNT Number of total L1 Cache misses for a thread Number of total L1 ICache misses for a thread Number of total L1 DCache misses for a thread Current Instruction Queue population for a thread L1MISSCOUNT ACCIPC STALL COUNT RR Accumulated IPC for a thread Number of total stalls incurred for a thread Round-Robin scheduling Figure 6. Type 3 heuristic for determination of a new fetch policy Type 3 heuristic works as follows. Suppose that BR- COUNT is the incumbent fetch policy when low throughput is detected. It implies that BRCOUNT has not worked well during the last quantum and there is no crucial imbalance among threads of the current set about conditional branches; imbalance might be in other factors. Now we can guess that one of the other policies, or L1MISSCOUNT may work better. Now, we consider the condition, COND MEM and check its value. If it holds true, then it implies that the imbalance might have been in the number of L1 cache misses or the usage of the load/store queue. Thus the transition will be made to L1MISSCOUNT. Otherwise, the problem might not lie in memory usages and the transition will be made to which works best on the average. For another type of heuristic, Type 4, we add two features. The first one is to take into account the gradient of the throughput. Even when low throughput is detected, if the throughput is higher than the throughput observed one quantum earlier (positive gradient), switching policies is not allowed. That way, we are waiting for the situation to keep improving with the original fetch policy. The second feature is to keep track of the switching history. In the switching history buffer, the followings are recorded for each policy switching event. Incumbent policy: The fetch policy that is originally engaged before a switching takes place. Value of the condition: For each policy, there is one condition that is checked. The value of the condition is recorded. Counter for positive outcomes (poscnt): This counter is incremented every time a specific case ended up with increase in throughput. Counter for negative outcomes (negcnt): This counter is incremented every time a specific case ended up with decrease in throughput. Before making the final decision, poscnt and negcnt are compared. If poscnt is greater, then a regular switching is made. Otherwise, the opposite direction will be chosen. For instance, suppose the incumbent policy was and low throughput is detected. Then with COND BR being true, transition should have been toward BRCOUNT policy with Type 3 heuristic. In Type 4, the counters (poscnt and negcnt) are examined and if poscnt is not greater than negcnt, the transition will be made toward the opposite, L1MISSCOUNT. Table 1. Various Fetch Policies tested 5. Methodology We used the SimpleSMT simulator [9] which is an extension of the SimpleScalar tool set [1]. It thus inherits most architectural specifications of the superscalar model in SimpleScalar. The main architectural difference between SimpleSMT and SimpleScalar is that SimpleSMT has separate integer and floating-point instruction queues and more pipeline stages to reflect the additional complexity of SMT. The simulation environment has been configured to have resources compatible with previous research on SMT [20] (for verification purposes) as we did in our previous work [15]. We used SPEC CPU2000 [7] as our simulation workloads and formed thirteen program mixtures depending on each program s properties: IPC on a single threaded machine model, memory footprint and whether an application requires floating-point operations or not. For combinations with a mix of integer and floating-point applications, we attempted to make the mix as even as possible. For simulation of 4- and 6-thread cases, some applications were randomly chosen to be excluded from the 8-thread mixes. We modeled ten different fetch policies as shown in Table 1. BRCOUNT, L1DMISS COUNT, and RR were proposed and evaluated in [20]. Additionally, we included in our list LDCOUNT, MEMCOUNT, ACCIPC and STALL COUNT. The description of each policy is found in the table. L1MISS COUNT and L1IMISS COUNT were added to have a closer look at the effect of the caches. At each cycle, the simulator sorts out threads according to the fetch policy. Instructions are fetched from the first thread as long as the cache block boundary is not met. If no boundary is encountered, all eight instructions are fetched from one thread. Otherwise, instructions can be fetched from the next thread. We limited the number of threads that can be fetched in one cycle to two. A study [2] showed that fetching all eight instructions from one thread can adversely affect the performance due to fetch fragmentation. For fair comparison, we applied the same mechanism to both fixed scheduling and adaptive scheduling. Because of the huge size of the SPEC 2000 applications, it is almost impossible to run simulations until the end of all programs. Since the reference mode of a typical SPEC 2000 application has an average of 200 billion instructions, it would take about three months to completely run one application since the performance of our simulator is about 25K instructions per second. To lower the time requirement and still get accurate sim-

7 ulation results, we ran simulation for a million cycles in ten randomly chosen different intervals by taking advantage of the fast-forward feature of the SimpleScalar simulator [1]. 6. Experimental Results Figure 7 a) and c) verify what we had surmised in 4.2. As the threshold value increases, more switchings incur for all types of heuristics. The quality of a switch decreases as the threshold value, but not as fast as the number of switchings increases. Note that with Type 1 and Type 2, it is not the case; the quality of the switch may be higher with the threshold value of 3 than with 2. Figure 7 b) and d) shows how the policy determination heuristic type affects the frequency and quality of switchings. Type 3 represents the Type 3 heuristic plus considering gradient of throughput. It is interesting to note that Type 4 heuristic results in more low-quality (malignant) switchings. This implies that determining a new fetch policy based on historical performance is not effective. aggregate IPC IPC Average for All Combinations Threshold Value (a) IPC vs. threshold values Average for All Combinations Threshold Value Type 4 Type 3' Type 3 Type 2 Type 1 Type 1 Type 2 Type 3 Type 3' Type 4 (c) IPC vs. threshold value for each type IPC aggregate IPC Average for All Combinations Type 1 Type 2 Type 3 Type 3' Type 4 Policy Determination Heurstic Type (b) IPC vs. type Average for All Combinations Type 1 Type 2 Type 3 Type 3' Type 4 Policy Determination Heurstic Type (d) IPC vs. type for each threshold value m=5 m=4 m=3 m=2 m=1 m=1 m=2 m=3 m=4 m= switches Type 4 Type 3' Type 3 Type 2 Type 1 switches Figure 8. Effect of the threshold value and policy determination heuristic on throughput (average of all mixtures) Type 1 Type 2 Type 3 Type 3' Type 4 Threshold Value Policy Determination Heurstic Type (a) Number of switchings vs. threshold value Probability of Benign Switches (aggregate) Type 4 Type 3' Type 3 Type 2 Type 1 (b) Number of switchings vs. type Probability of Benign Switches (aggregate) Type 1 Type 2 Type 3 Type 3' Type tem calls (like SimpleScalar) and, instead, it translates them into the host system calls for efficient simulation sacrificing accuracy. We assumed that when a thread encounters a system call, all threads have to flush out of the pipeline before the system call can be started, which is the most conservative assumption. In real situations, if the system call is for critical operations like memory allocation, that will be the case because such operations may affect all threads resident in the processor. Threshold Value (c) Probability of benign switches vs. threshold value Policy Determination Heurstic Type (d) Probability of benign switches vs. type Figure 7. Effect of the threshold value on switch occurrence and quality Figure 8 shows the effects of the IPC threshold value on the throughput. Obviously, the best performance is reached when the threshold value is 2 and Type 3 heuristic is used. The maximum performance improvement over is about 30%. The values in the graphs are actually the one averaged over all various mixtures. We also found that greater improvements can be achieved when more similar applications are found in a mixture. With a mixture of various applications, less improvement was achieved. It should also be noted that the throughput values observed in this experiment are relatively low considering the number of threads involved. The reason lies in the configuration we chose for simulation. The SimpleSMT does not simulate sys- 7. Summary and Conclusion This paper has investigated how much more improvement can be made by allowing an adaptive dynamic thread scheduling approach rather than the fixed scheduling approaches employed in earlier work. It proposed the detector thread approach to implement adaptive scheduling with low hardware and software overhead. The detector thread is a special thread that occupies one designated thread context with minimal extra hardware. It is scheduled for execution when idle slots are available. To validate the idea, we used the SimpleSMT simulator to derive an upper-bound for the performance improvement we can hope to achieve using our approach. SPEC 2000 applications were used to create thirteen various mixes of applications based on single-application performance, memory footprint and type (integer or floating-point). Simulation results showed that there still is significant room (27%) for performance improvement over fixed scheduling for eight threads on which adaptive scheduling can work. This paper stresses that adaptive scheduling is feasible because our platform is SMT where it is possible to have

8 one thread resident in the processor with minimal overhead. The results we obtained in this study are greatly encouraging. Since SMT was introduced, studies have shown that having too many threads (usually more than four or five) will not return the expected throughput increase and sometimes even lower the throughput. Our study has shown that adaptive thread scheduling in combination with a detector thread can significantly extend the saturation point in terms of number of threads provided that the detector thread is programmed with effective low-throughput detection and fetch policy selection algorithms. The software architecture for the detector thread was developed and various heuristics were evaluated for determining the fetch policy to be used for the next scheduling quantum. Type 3 turned out to work best with the threshold value of 2. Type 4 which keeps track of outcomes of earlier decisions turned out not worthy of the efforts because there seemed to be no correlation in time domain regarding the fetch policies because there is no fixed pattern about the interactions between independent threads. Once the job scheduler is put into the picture, because more dynamic change in the set of applications is going to take place, correlation in time domain will be even harder to find. We also found that with a mixture of various applications, less improvement was achieved with the ADTS over the fixed scheduling of. That is because we have a good mixture of applications so that we can maintain high utilization of various resources available in the processor. Consequently, we may ask the following question. Why not we just let the job scheduler concentrate on co-scheduling wellbalanced sets of applications? Then, will work well and not much improvement can be made over it with the adaptive scheduling. Our answer to the question is no. There are two reasons for the answer. The first one is that the job scheduler cannot co-schedule well-balanced sets of applications all the time, especially when the number of jobs available in the system is not significantly larger than the number of the hardware contexts of an SMT processor. The other reason being is that the job scheduler would have to stay on the processor for significantly longer duration had it not been for the detector thread. References [1] T. Austin. The SimpleScalar Architectural Research Tool Set, Version 2.0. Technical Report 1342, University of Wisconsin- Madison, June [2] J. Burns and J.-L. Gaudiot. Exploring the SMT Fetch Bottleneck. In Proceedings of the Workshop on Multithreaded Execution, Architecture and Compilation (MTEAC99), Orlando, Florida, January [3] R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous Subordinate Microthreading (SSMT). In Proceedings of the 26th Annual International Symposium on Computer Architecture, pages , May [4] S. Eggers, J. Emer, H. Levy, J. Lo, R. Stamm, and D. Tullsen. Simultaneous Multithreading: A Platform for Next-Generation Processors. IEEE Micro, pages 12 18, September/October [5] M. Gulati and N. Bagherzadeh. Performance Study of a Multithreaded Superscalar Microprocessor. In Proceedings of the 2nd International Symposium on High Performance Computer Architecture, pages , Feburary [6] L. Gwenlapp. Dansoft Develops VLIW Design. Microprocessor Report, 11(2):18 22, Feburary [7] J. Henning. SPEC CPU2000: Measuring CPU Performance in the New Millennium. IEEE Computer, 33(7):28 35, July [8] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa. An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threadds. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages , May [9] S. Lee and J.-L. Gaudiot. ALPSS: Architectural Level Power Simulator for Simultaneous Multithreading, Version 1.0. Technical Report TR-02-04, University of Southern California, April [10] J.Lo,S.Eggers,J.Emer,H.Levy,R.Stamm,andD.Tullsen. Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading. ACM Transactions on Computer Systems, pages , August [11] C. Luk. Tolerating Memory Latency through Software- Controlled Pre-Execution in Simultaneous Multithreading Processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 40 51, June [12] M. McCormick, J. Ledlie, and O. Zaki. Adaptively Scheduling Processes on a Simultaneous Multithreading Processor. Technical report, University of Wisconsin - Madison, [13] S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-Sensitive Scheduling for SMT Processors. Technical report, University of Washington, [14] A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proceedings of the 7th International Symposium on High Performance Computer Architecture, pages 37 48, Monterrey, Mexico, January [15] C. Shin, S. Lee, and J.-L. Gaudiot. The Need for Adaptive Dynamic Thread Scheduling in Simultaneous Multithreading. In Proceedings of the 1st Workshop on Hardware/Software Support for Parallel and Distributed Scientific and Engineering Computing (SPDSEC-02) in conjunction with the 11th International Conference on Parallel Architectures and Compilation Techniques (PACT-02), September [16] U. Sigmund and T. Ungerer. Evaluating a Multithreaded Superscalar Microprocessor versus a Multiprocessor Chip. In Proc. of the 4 th PASA Workshop Parallel Systems and Algorithms, pages , April [17] A. Snavely and D. Tullsen. Symbiotic Jobscheduling for a Simultaneous Multithreading Architecture. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages , Cambridge, Massachussets, November [18] Y. Song and M. Dubois. Assisted Execution. Technical Report Technical Report CENG 98-25, University of Southern California, [19] G. E. Suh, S. Devadas, and L. Rudolph. A New Memory Monitoring Scheme for Memory-Aware Scheduling. In Proceedings of the High Performance Computer Architecture (HPCA 02) Conference, Feburary [20] D.Tullsen,S.Eggers,J.Emer,H.Levy,J.Lo,andR.Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages , May [21] D. Tullsen, S. Eggers, and H. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages , June [22] H. Wang, P. Wang, R. Weldon, and et. al. Speculative Precomputation: Exploring the Use of Multithreading for Latency. Intel Technology Journal, 6(1), Feburary 2002.

Operating System Impact on SMT Architecture

Operating System Impact on SMT Architecture Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th

More information

Thread level parallelism

Thread level parallelism Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

More information

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Architectural Support for Enhanced SMT Job Scheduling

Architectural Support for Enhanced SMT Job Scheduling Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andrew Janiszewski Dan Connors University of Colorado at Boulder Department of Electrical and Computer Engineering 5 UCB, Boulder,

More information

Thread-Sensitive Scheduling for SMT Processors

Thread-Sensitive Scheduling for SMT Processors Thread-Sensitive Scheduling for SMT Processors Sujay Parekh IBM T.J. Watson Research Center sujay@us.ibm.com Henry Levy University of Washington levy@cs.washington.edu Susan Eggers University of Washington

More information

Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy?, Rajeev Garg?, David H. Albonesi y and Sandhya Dwarkadas?

Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy?, Rajeev Garg?, David H. Albonesi y and Sandhya Dwarkadas? Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy?, Rajeev Garg?, David H. Albonesi y and Sandhya Dwarkadas?? Depments of Electrical and Computer Engineering and of Computer

More information

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Dean M. Tullsen, Susan J. Eggers, Joel S. Emer y, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

CPU Scheduling Outline

CPU Scheduling Outline CPU Scheduling Outline What is scheduling in the OS? What are common scheduling criteria? How to evaluate scheduling algorithms? What are common scheduling algorithms? How is thread scheduling different

More information

HyperThreading Support in VMware ESX Server 2.1

HyperThreading Support in VMware ESX Server 2.1 HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect

More information

Probabilistic Modeling for Job Symbiosis Scheduling on SMT Processors

Probabilistic Modeling for Job Symbiosis Scheduling on SMT Processors 7 Probabilistic Modeling for Job Symbiosis Scheduling on SMT Processors STIJN EYERMAN and LIEVEN EECKHOUT, Ghent University, Belgium Symbiotic job scheduling improves simultaneous multithreading (SMT)

More information

Multithreading Lin Gao cs9244 report, 2006

Multithreading Lin Gao cs9244 report, 2006 Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............

More information

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances: Scheduling Scheduling Scheduling levels Long-term scheduling. Selects which jobs shall be allowed to enter the system. Only used in batch systems. Medium-term scheduling. Performs swapin-swapout operations

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

Energy-Efficient, High-Performance Heterogeneous Core Design

Energy-Efficient, High-Performance Heterogeneous Core Design Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

More information

OC By Arsene Fansi T. POLIMI 2008 1

OC By Arsene Fansi T. POLIMI 2008 1 IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Operating Systems, 6 th ed. Test Bank Chapter 7

Operating Systems, 6 th ed. Test Bank Chapter 7 True / False Questions: Chapter 7 Memory Management 1. T / F In a multiprogramming system, main memory is divided into multiple sections: one for the operating system (resident monitor, kernel) and one

More information

IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

More information

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Journal of Instruction-Level Parallelism 7 (25) 1-28 Submitted 2/25; published 6/25 Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Joshua Kihm Alex Settle Andrew

More information

Operating Systems. Virtual Memory

Operating Systems. Virtual Memory Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

Windows Server Performance Monitoring

Windows Server Performance Monitoring Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

More information

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2 Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

More information

Chapter 5 Process Scheduling

Chapter 5 Process Scheduling Chapter 5 Process Scheduling CPU Scheduling Objective: Basic Scheduling Concepts CPU Scheduling Algorithms Why Multiprogramming? Maximize CPU/Resources Utilization (Based on Some Criteria) CPU Scheduling

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Contributions to Gang Scheduling

Contributions to Gang Scheduling CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

Hardware/Software Co-Design of a Java Virtual Machine

Hardware/Software Co-Design of a Java Virtual Machine Hardware/Software Co-Design of a Java Virtual Machine Kenneth B. Kent University of Victoria Dept. of Computer Science Victoria, British Columbia, Canada ken@csc.uvic.ca Micaela Serra University of Victoria

More information

Enterprise Applications

Enterprise Applications Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

POWER8 Performance Analysis

POWER8 Performance Analysis POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs satsadas@in.ibm.com #OpenPOWERSummit Join the conversation at #OpenPOWERSummit

More information

MONITORING power consumption of a microprocessor

MONITORING power consumption of a microprocessor IEEE TRANSACTIONS ON CIRCUIT AND SYSTEMS-II, VOL. X, NO. Y, JANUARY XXXX 1 A Study on the use of Performance Counters to Estimate Power in Microprocessors Rance Rodrigues, Member, IEEE, Arunachalam Annamalai,

More information

TPCalc : a throughput calculator for computer architecture studies

TPCalc : a throughput calculator for computer architecture studies TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University pierre.michaud@inria.fr Stijn.Eyerman@elis.UGent.be

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Chapter 5 Linux Load Balancing Mechanisms

Chapter 5 Linux Load Balancing Mechanisms Chapter 5 Linux Load Balancing Mechanisms Load balancing mechanisms in multiprocessor systems have two compatible objectives. One is to prevent processors from being idle while others processors still

More information

Process Scheduling CS 241. February 24, 2012. Copyright University of Illinois CS 241 Staff

Process Scheduling CS 241. February 24, 2012. Copyright University of Illinois CS 241 Staff Process Scheduling CS 241 February 24, 2012 Copyright University of Illinois CS 241 Staff 1 Announcements Mid-semester feedback survey (linked off web page) MP4 due Friday (not Tuesday) Midterm Next Tuesday,

More information

Multiprogramming Performance of the Pentium 4 with Hyper-Threading

Multiprogramming Performance of the Pentium 4 with Hyper-Threading In the Third Annual Workshop on Duplicating, Deconstructing and Debunking (WDDD2004) held at ISCA 04. pp 53 62 Multiprogramming Performance of the Pentium 4 with Hyper-Threading James R. Bulpin and Ian

More information

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS AN INSTRUCTION WINDOW THAT CAN TOLERATE LATENCIES TO DRAM MEMORY IS PROHIBITIVELY COMPLEX AND POWER HUNGRY. TO AVOID HAVING TO

More information

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached

More information

Hardware Task Scheduling and Placement in Operating Systems for Dynamically Reconfigurable SoC

Hardware Task Scheduling and Placement in Operating Systems for Dynamically Reconfigurable SoC Hardware Task Scheduling and Placement in Operating Systems for Dynamically Reconfigurable SoC Yuan-Hsiu Chen and Pao-Ann Hsiung National Chung Cheng University, Chiayi, Taiwan 621, ROC. pahsiung@cs.ccu.edu.tw

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

Thread Level Parallelism II: Multithreading

Thread Level Parallelism II: Multithreading Thread Level Parallelism II: Multithreading Readings: H&P: Chapter 3.5 Paper: NIAGARA: A 32-WAY MULTITHREADED Thread Level Parallelism II: Multithreading 1 This Unit: Multithreading (MT) Application OS

More information

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern: Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove

More information

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended

More information

The Truth Behind IBM AIX LPAR Performance

The Truth Behind IBM AIX LPAR Performance The Truth Behind IBM AIX LPAR Performance Yann Guernion, VP Technology EMEA HEADQUARTERS AMERICAS HEADQUARTERS Tour Franklin 92042 Paris La Défense Cedex France +33 [0] 1 47 73 12 12 info@orsyp.com www.orsyp.com

More information

A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors

A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors Leonid Domnitser, Nael Abu-Ghazaleh and Dmitry Ponomarev Department of Computer Science SUNY-Binghamton {lenny,

More information

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio

More information

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operatin g Systems: Internals and Design Principle s Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Bear in mind,

More information

2. is the number of processes that are completed per time unit. A) CPU utilization B) Response time C) Turnaround time D) Throughput

2. is the number of processes that are completed per time unit. A) CPU utilization B) Response time C) Turnaround time D) Throughput Import Settings: Base Settings: Brownstone Default Highest Answer Letter: D Multiple Keywords in Same Paragraph: No Chapter: Chapter 5 Multiple Choice 1. Which of the following is true of cooperative scheduling?

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Department of Electrical Engineering Princeton University {carolewu, mrm}@princeton.edu Abstract

More information

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

More information

Testing Database Performance with HelperCore on Multi-Core Processors

Testing Database Performance with HelperCore on Multi-Core Processors Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob Problem

More information

The Importance of Software License Server Monitoring

The Importance of Software License Server Monitoring The Importance of Software License Server Monitoring NetworkComputer How Shorter Running Jobs Can Help In Optimizing Your Resource Utilization White Paper Introduction Semiconductor companies typically

More information

Validating Java for Safety-Critical Applications

Validating Java for Safety-Critical Applications Validating Java for Safety-Critical Applications Jean-Marie Dautelle * Raytheon Company, Marlborough, MA, 01752 With the real-time extensions, Java can now be used for safety critical systems. It is therefore

More information

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Introduction Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Advanced Topics in Software Engineering 1 Concurrent Programs Characterized by

More information

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design. Enhancing Memory Level Parallelism via Recovery-Free Value Prediction Huiyang Zhou Thomas M. Conte Department of Electrical and Computer Engineering North Carolina State University 1-919-513-2014 {hzhou,

More information

CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems

CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems Based on original slides by Silberschatz, Galvin and Gagne 1 Basic Concepts CPU I/O Burst Cycle Process execution

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

An Oracle White Paper July 2012. Load Balancing in Oracle Tuxedo ATMI Applications

An Oracle White Paper July 2012. Load Balancing in Oracle Tuxedo ATMI Applications An Oracle White Paper July 2012 Load Balancing in Oracle Tuxedo ATMI Applications Introduction... 2 Tuxedo Routing... 2 How Requests Are Routed... 2 Goal of Load Balancing... 3 Where Load Balancing Takes

More information

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield. Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December

More information

ICS 143 - Principles of Operating Systems

ICS 143 - Principles of Operating Systems ICS 143 - Principles of Operating Systems Lecture 5 - CPU Scheduling Prof. Nalini Venkatasubramanian nalini@ics.uci.edu Note that some slides are adapted from course text slides 2008 Silberschatz. Some

More information

Operating Systems 4 th Class

Operating Systems 4 th Class Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science

More information

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs Presenter: Bo Zhang Yulin Shi Outline Motivation & Goal Solution - DACOTA overview Technical Insights Experimental Evaluation

More information

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

More information

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier. Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core

More information

Capacity Estimation for Linux Workloads

Capacity Estimation for Linux Workloads Capacity Estimation for Linux Workloads Session L985 David Boyes Sine Nomine Associates 1 Agenda General Capacity Planning Issues Virtual Machine History and Value Unique Capacity Issues in Virtual Machines

More information

BridgeWays Management Pack for VMware ESX

BridgeWays Management Pack for VMware ESX Bridgeways White Paper: Management Pack for VMware ESX BridgeWays Management Pack for VMware ESX Ensuring smooth virtual operations while maximizing your ROI. Published: July 2009 For the latest information,

More information

Two-Stage Forking for SIP-based VoIP Services

Two-Stage Forking for SIP-based VoIP Services Two-Stage Forking for SIP-based VoIP Services Tsan-Pin Wang National Taichung University An-Chi Chen Providence University Li-Hsing Yen National University of Kaohsiung Abstract SIP (Session Initiation

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Precise and Accurate Processor Simulation

Precise and Accurate Processor Simulation Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin Madison http://www.ece.wisc.edu/~pharm Performance Modeling Analytical

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano

More information

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1 Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems

More information

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code

More information

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS?

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS? 18-345: Introduction to Telecommunication Networks Lectures 20: Quality of Service Peter Steenkiste Spring 2015 www.cs.cmu.edu/~prs/nets-ece Overview What is QoS? Queuing discipline and scheduling Traffic

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Eight Ways to Increase GPIB System Performance

Eight Ways to Increase GPIB System Performance Application Note 133 Eight Ways to Increase GPIB System Performance Amar Patel Introduction When building an automated measurement system, you can never have too much performance. Increasing performance

More information

OPERATING SYSTEM - VIRTUAL MEMORY

OPERATING SYSTEM - VIRTUAL MEMORY OPERATING SYSTEM - VIRTUAL MEMORY http://www.tutorialspoint.com/operating_system/os_virtual_memory.htm Copyright tutorialspoint.com A computer can address more memory than the amount physically installed

More information

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Performance Impacts of Non-blocking Caches in Out-of-order Processors Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order

More information

CPU Scheduling. CPU Scheduling

CPU Scheduling. CPU Scheduling CPU Scheduling Electrical and Computer Engineering Stephen Kim (dskim@iupui.edu) ECE/IUPUI RTOS & APPS 1 CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling

More information

2

2 1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring

More information

Intel DPDK Boosts Server Appliance Performance White Paper

Intel DPDK Boosts Server Appliance Performance White Paper Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Resource Allocation Schemes for Gang Scheduling

Resource Allocation Schemes for Gang Scheduling Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian

More information

Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors

Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors Robert L. McGregor Christos D. Antonopoulos Department of Computer Science The College of William & Mary Williamsburg, VA 23187-8795

More information

FLIX: Fast Relief for Performance-Hungry Embedded Applications

FLIX: Fast Relief for Performance-Hungry Embedded Applications FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...

More information

OpenFlow Based Load Balancing

OpenFlow Based Load Balancing OpenFlow Based Load Balancing Hardeep Uppal and Dane Brandon University of Washington CSE561: Networking Project Report Abstract: In today s high-traffic internet, it is often desirable to have multiple

More information

Technical Properties. Mobile Operating Systems. Overview Concepts of Mobile. Functions Processes. Lecture 11. Memory Management.

Technical Properties. Mobile Operating Systems. Overview Concepts of Mobile. Functions Processes. Lecture 11. Memory Management. Overview Concepts of Mobile Operating Systems Lecture 11 Concepts of Mobile Operating Systems Mobile Business I (WS 2007/08) Prof Dr Kai Rannenberg Chair of Mobile Business and Multilateral Security Johann

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

More information

HP Smart Array Controllers and basic RAID performance factors

HP Smart Array Controllers and basic RAID performance factors Technical white paper HP Smart Array Controllers and basic RAID performance factors Technology brief Table of contents Abstract 2 Benefits of drive arrays 2 Factors that affect performance 2 HP Smart Array

More information

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum Scheduling Yücel Saygın These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum 1 Scheduling Introduction to Scheduling (1) Bursts of CPU usage alternate with periods

More information

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

More information

A Review of Customized Dynamic Load Balancing for a Network of Workstations

A Review of Customized Dynamic Load Balancing for a Network of Workstations A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester

More information