OpenMP in Multicore Architectures

Size: px
Start display at page:

Download "OpenMP in Multicore Architectures"

Transcription

1 OpenMP in Multicore Architectures Venkatesan Packirisamy, Harish Barathvajasankar Abstract OpenMP is an API (application program interface) used to explicitly direct multi-threaded, shared memory parallelism. With the advent of Multi-core processors, there has been renewed interest in parallelizing programs. Multi-core offers support to execute threads in parallel but at the same time the cost of communication is very less. This opens up new domains to extract parallelism. The aim of our project is to study how OpenMP could be used in Multi-core processors. We investigated this in two parts - extracting fine grained parallelism and extracting speculative parallelism. The report mainly concentrates on the second part - speculative parallelism, though we offer some insights into fine grained parallelism. Our results show the specifying hints through OpenMP directives is important to expose parallelism. Also we propose a hardware technique that could improve performance of thread level speculation. 1 Introduction OpenMP is an API (application program interface) used to explicitly direct multithreaded, shared memory parallelism. OpenMP was introduced in 1997 to standardize programming extensions for shared memory machines. It has been widely used since then in parallelizing scientific programs. In OpenMP the user specifies the regions in the code that are parallel. The user also specifies necessary synchronization like locks, barriers, etc to ensure correct execution of the parallel region. At runtime threads are forked for the parallel region and are typically executed in different processors sharing the same memory and address space. Recently many chip manufacturing companies have announced multi-core processors [1][2]. In multi-core processors, each processor die has many processing elements and they typically have a common L2 cache. The whole die acts as a traditional multiprocessor system except that they now share the L2 cache. The advantage now is that we could execute different threads in these processing elements and the communication cost between the threads is very less since they share the L2 cache. Another advantage of having multiple cores is that we could use these cores to extract thread level parallelism in a program and hence increase the performance of the single program. A lot of research has been done on this area. Many techniques rely on hardware based mechanisms [3][4] and some depend on compiler to extract the threads [5][6]. 1

2 So we see that we have a system which can be used as a traditional multiprocessor system with minimum communication cost and also it can be used to improve single thread programs. OpenMP so far have been used only in traditional multiprocessor environments and we saw that the multi-core processors are very similar to a traditional multiprocessor. So it is natural to think of OpenMP in multi-core processors. Also we saw that multi-core processors can be used as a traditional multiprocessors or it can be used to implement TLS. Consequently OpenMP can also be used in both the ways: Fine grained parallelism Because of the minimal communication cost in multicore processors, we could parallelize loops that suffered due to high communication cost in traditional multiprocessors. We could parallelize inner loops that were previously not parallelizable due to their communication cost and size. And OpenMP can be extended to support additional directives needed to support this fine-grained parallelism. Speculative Parallelism We saw that multi-core processors can be used to improve single thread performance by using thread level speculation (TLS). OpenMP can be used as a way for the user to specify hints to the compiler and hardware, exposing the parallelism in the speculative threads. The rest of the paper is organized as follows: in section 2 we will discuss the related work done in this field. In section 3 we would discuss the potential for using OpenMP to extract fine grained parallelism. In section 4 we will discuss the simulation environment used in the study of speculative parallelism. In section 5 we would discuss how we could use OpenMP for speculative parallelization. In section 6 we would present the parallelization framework we used to parallelize the loops. In section 7 we will see the results comparing our parallelized loops and the loops parallelized by the compiler [5]. In section 8 we would discuss the possible hardware optimizations possible. In section 9 we would conclude and in section 10 we would discuss the future work. 2 Related Work Using OpenMP for extracting multi-level parallelism is studied in [6]. Here some OpenMP directives were proposed to extract multi-level parallelism. We would want a similar technique for multi-core environments, where say the outer loop is parallelized between processors and the inner loop is parallelized for the processing elements inside each processor. [8] Tried to apply speculative techniques to improve OpenMP programs. Here the threads do not always (depending on the OpenMP hint) wait at synchronization points. So violations could occur which are detected and the offending thread is squashed. In our report we use OpenMP to improve thread level parallelism of integer programs which are not inherently parallel. 2

3 Thread level speculation has been widely studied as a technique which can extract thread level speculation. Here the threads are either extracted in hardware [9][10][11][15] or in software [12][13][16] and are executed in parallel. The threads are not always independent, so if a dependence violation occurs, it is typically checked in hardware and the offending thread is squashed and restarted. In [8-ant] compiler is used to schedule the instructions in the threads so as to reduce the impact of inter-thread data dependencies. The code generated using this technique is used to compare with the code we generate. In [12] pipelined execution of threads is studied. The parallelization framework we use is very similar to the thread pipelining, except our method is more general and is more easily applicable to different loops. Also in our case all compiler optimizations [8] are still applicable. In [13][14] hardware based synchronization mechanisms have been studied to improve the performance of the synchronization operations. In this report we propose to use hardware based update mechanisms which are very similar in principle to the above mechanisms. [16] Used manual parallelization to improve performance of speculative multithreading. Our report does not modify the basic algorithm design in the code. It only exposes parallelism to the compiler and hardware by providing hints. 3 Fine-grained parallelism As the first possibility of using OpenMP in multi-cores, we tried to extract parallelism in the inner loop of SpecOMP benchmarks. But we were not able use the OpenMP compiler [15] to compile the SpecOMP benchmarks. So we could not proceed very much in this direction. We tried to study the possible parallelism in SpecOMP benchmarks by manually looking at the code. In the benchmark equake, we found that the OpenMP had been used to parallelize 14 of the major loops. There were some inner loops and also there are some loops that that has not been parallelized by OpenMP. But when we looked at the loops closer, we found that the loops that were not parallelized have small iteration count. Also the inner loops also have very small loop body. We were not able to analyze the FORTRAN based benchmarks because of our limited familiarity with FORTRAN. Due to the apparent lack of potential and the difficulty in understanding and compiling the SpecOMP code, we abandoned this part and started focusing on the other part - speculative parallelism. The rest of the paper concentrates on the main focus - speculative parallelization of integer programs using OpenMP directives. First we shall see the simulation methodology and the basic multi-core architecture used in for this project. 3

4 4 Simulation Methodology For our study we used a simplescalar based Chip Multiprocessor (CMP) simulator. The basic configuration of the simulator is given in fig 1. Figure 1: Basic Configuration 4.1 Basic Architecture In the simulator CMP architecture we have 4 processing elements. Each has the basic configuration stated above. And they have private L1 cache and a shared L2 cache. Also there is a signal table for each processing element which is used to synchronize to avoid squashing. Each processing element has a store buffer of size 2K entries and a load-address buffer of size 2K entries. When a speculative thread executes in a processor, the stores values are put into the store buffer. When a load is issued the address of the load is put in the address buffer. For each store in a thread, subsequent thread s address buffers are checked to see if there is any dependence violation. If there is any violation detected, the whole thread and the subsequent threads are squashed and the execution of the first thread violating is restarted. 4.2 Tool used One of the major parts of our project is to analyze the different benchmark programs to find scope for the user to specify hints. We used a simplescalar based tool to study the loops. We use the tool to identify frequently occurring data dependencies between threads. Though we have other tools based on dependence profiling and SCCs, we chose to implement our own tool because the other tools don t give which PC address actually cause the dependence. We need this information because we manually analyze the benchmarks and we need to find which statements actually cause dependence violation. To use the tool, we just mark the loop we want to study, and the tool would print out the PC addresses of store-load dependence pairs. 4

5 Figure 2: CMP Architecture 4.3 Benchmarks studied We aimed to study 6 of the Spec2000 integer benchmarks. We did analyze 6 benchmarks (fig. 3), we present our results only for 3 of the benchmarks. For the remaining benchmarks we provide insights on what the behavior of their loops and how it can be parallelized. Figure 3: Benchmark Description 4.4 Simulation methodology Since we had run the simulation many times till we get good performance we worked only with test input sets. For all three benchmarks for which the results are presented only for test input sets. 5 Speculative parallelization In this section we would use the directives given by the user to improve the performance of hardware and compiler mechanisms to improve thread level speculation. First we will see what the possibilities are. Then in section 5.2 we will present a 5

6 framework for exposing parallelism. In section 5.3 we would compare our framework to the other related work done. In section 5.4 we would discuss how the different benchmarks we studied fit in this framework. 5.1 OpenMP directives There have been many hardware and compiler based techniques proposed to improve Thread level speculation. Compiler techniques aim to reduce the complexity of the hardware techniques by moving some of the complexity to the hardware. Also the compiler due to its global view is better placed to extract Thread level speculation. Even the hardware based techniques usually assume some form of compiler support in selecting the region or to specify synchronization. But the compiler also lacks some run time information. One way to solve this problem is to allow the user to specify some hints that would help the compiler as well as the architecture. The possible directives from the user can be broadly classified into three categories: Identifying regions to parallelize One of the major decisions for compiler or hardware to make is which loop or region to parallelize. This is a very hard decision and usually done after taking the execution time of the loop and the dependence properties of the loop. A user (or an optimizer) is in a better position to take the decision. The user can just identify the loop and communicate it through a directive. This is very similar to the OpenMP parallel directive, but now the loop is only speculatively parallel. Changing loop body After identifying the loop, the hardware can execute the threads in different processors simultaneously. But this may cause frequent squashes due to dependence violations. So the hardware could selectively synchronize on some dependencies. Usually this requires compiler support and the compiler can also schedule the code to reduce the impact of data dependencies. Here also the user is in a better position to optimize the code. The user could just indicate specific information for each load whether there is going to be a dependence violation or the user can also re-arrange the loop to reduce the dependence violation. This is the part we try to answer in our project. Other specific information Other than the above two major information, the hardware may still need some specific information to efficiently execute the threads. Eg. The user could specify the typical iteration count for the loop. The compiler can take important decisions based on this. The user could indicate the load balancing information that could help the hardware to efficiently distribute the work. This part we don t consider in our project. 6

7 6 Framework for parallelization In this section we will discuss the typical behavior of the benchmarks loops. From the behavior we derive a framework for manual parallelization. By framework, we mean a specific code layout on how the code should be arranged so that it exposes to the hardware or compiler the parallelism in the loop. 6.1 Method of analysis To analyze a benchmark program, we first get the profile information and find the loops that have high coverage. Then the data dependence profile information is looked at. If the number of dependencies is really large, that loop is not considered further. Then the simplescalar based tool is used to find the actual statements in the source code that cause the mis-speculations. Then the source code is analyzed to see why exactly the dependence occurred. A typical loop in integer program would look like: From our analysis, we could classify the benchmark loops as two types: Possible parallel loops In most loops, the data needed to start the next iteration is not dependent on the main processing done in the iteration. Eg. The inner loop in the function refresh potential of mcf. Here the induction variable is the only data needed to start the next iteration and the induction variable is independent of the processing in the current iteration. Serial loops But there are many loops, were the next iteration is very much dependent on the processing in the current iteration. Eg. The loop in the sort basket() function of mcf has a very good coverage. But the loop is very serial - it is a quick sort algorithm, the next iteration is started based on the value of the current iteration. 6.2 Ideal parallel execution We saw that in some cases the only thing we need to start the thread is the induction variable. So ideally we want the loop execution to be. 7

8 Now we can start the loop immediately after generating data for the next iteration. But when we examine the loops, we found an interesting trend in the loops. 6.3 Usual model of Execution Apart from generating the values for the next iteration, the loops usually update some global variables. The results of the processing done in the current iteration are updated to some global variables or array. Updating values at the end seems to be very common across benchmarks. And unlike the induction variables these cannot be moved to the top of the iteration as it needs the results of the entire iteration. Eg. Consider the end of the iteration code of a loop in new_dbox_a of twolf: } else { m = 0 ; } tmp_missing_rows[net] = -m ; } delta_vert_cost += ((tmp_num_feeds[net] - num_feeds[net]) + (tmp_missing_rows[net] - missing_rows[net])) * 2 * rowheight ; We see, at the end of the iteration, the global variable delta vert cost is updated with the values calculated in the current iteration. Let s see how the current TLS mechanisms handle such situations: 8

9 Speculation One way is to execute the code speculatively. Usually the update occurs towards the end of the iteration and if the non-speculative thread completes first, then it doesn t cause a dependence violation. This may not always work, if the speculative thread runs faster, it experiences a squash. Synchronization We could synchronize on this dependency. This will avoid the squash, but it could serialize the code. In our example it is ok, but these types of reductions do occur in the middle of the iteration. Eg. Inner loop in twolf new dbox() Recovery code Recovery code could help, but there are limitations in applying the recovery code. Consider the loop in twolf - new dbox(): } *costptr += ABS( newx - new_mean ) - ABS( oldx - old_mean ) ; } Here the reduction occurs in the inner loop. So if we want to recover, we have to execute the entire inner loop. 6.4 Our Technique Among these techniques, synchronization seems to be a good choice. But the performance could be affected if the update occurs in the middle of the loop. To reduce this impact, we could privatize the update and add the results to a temporary variable. And at the end of the loop, the variable is updated to the global structure. In our technique we try to move the global update down. If it occurs in the middle we try to privatize the local update. Then in our basic technique the global update is guarded by wait instruction. We also identify the induction variable so that it does not cause un-necessary synchronization or speculation. Based on the above analysis, we believe that a manually parallelized loop should have the following structure. And in our basic technique we synchronize before entering the update phase. Different hardware based optimizations are possible and they are discussed in section 8. 9

10 From our initial analysis we found that the coverage of the loops that could have such framework structure is given below: Benchmark potential parallel loops (% of execution time coverage) Mcf Twolf Vpr Some of the loops were later found be either very large to be parallelized or they have high frequency inter-thread data dependences. 6.5 OpenMP hints The user has to specify hints that identify these phases in the loop. The user himself can rearrange the code, privatize the variables and do other optimizations. The compiler can do the same thing provided the basic hints identifying the phases are provided. 6.6 Behavior of benchmark programs In this section we will see how the different benchmark programs we analyzed fit into the framework we discussed in the previous section. Also we suggest techniques that could be further studied for the different benchmarks. a)twolf Two high coverage loops exactly fit into the framework. For these two loops it is very difficult to parallelize without the user directives. Some of the loops suffer from small size. For these loops unrolling could help, but unrolling could disrupt the loop structure and could cause mis-speculations. The hardware technique discussed later could be used for such loops. b) Mcf Mcf also has many loops that satisfy the code framework. Here also the loops suffer from small size. They could be unrolled but that violates the framework. c) vpr There were some loops that satisfy the global update property. But these loops are enclosed by an outer loop which is also parallel. Even a simple compiler technique can identify the parallelism. So we don t gain much by using our technique. d) Gzip and bzip2 Here there are some parallel loops, but they all have very small code size. They could give good performance if unrolled. We did not find any major loops which satisfy our framework. e) Vortex Vortex loops have lots of function calls and the function calls are usually have complicated nesting which made analyzing the loops very hard. From the result of using the tool we find that many dependences occur due to a status variable which is updated almost everywhere. But we believe that the value read would always be 0 unless an error occurred. So we could modify the hardware to squash the thread only when the value read was changed. 10

11 6.7 Comparison with other techniques In our technique we have three different phases of execution - induction variable calculation, main processing and global update. This is very similar to the thread pipelining strategy of [14]. In [14] the first two stages are same, but the update stage is replaced by a write-back stage. In [14], the aim is to always find all the information needed for other thread as soon as possible; in our case we move the updates downward. In [16], manual parallelization was discussed. But there the focus was on changing the code completely to increase parallelism. In our technique the user only exposes the available parallelism and the parallelism is never created. 7 Results 7.1 Results of basic technique The aim of passing user hints is to overcome the limitations of the compiler and hardware. So we compare our technique with the compiler based technique [15], where the code is scheduled to reduce the impact of data dependencies. We were able to get the code generated by using the scheduling technique from Prof. Antonia Zhai who had implemented that technique in the SUIF compiler. But the code had some directives specific to their environment. Also the code overhead was very huge in their code. So we replicated the effect of their code in the original benchmark code. Signal and wait instructions were inserted at exactly the same points. And the loops were unrolled exactly the same number of times. All other modifications are also done to match their code completely. Then we applied our technique over that code. Sometimes the same loops could be better scheduled by using our technique. At other times more loops could be parallelized which were not parallelized due to some limitations. In fig. 5 we see the comparison of the scheduling based technique and our framework based technique. The bars indicate the percentage decrease in the execution time after applying the technique. We see that performance sometimes is less than the single thread execution. One of the main reasons is the instruction overhead. Fig. 6 shows the percentage increase in the instructions executed after applying the technique. 7.2 Analysis of results Here we try to analyze the performance of each of the benchmarks. 1.mcf Though there are parallel loops the size of the loops are very small. When we unrolled the loops, the loops suffered mis-speculations. So the two loops pulled back the performance. In the case of the scheduler based code, the second loop was synchronized. This lead to serial execution and at the same 11

12 Figure 4: Performance Impact of the technique Figure 5: increase in the number of instructions executed time there were gaps in the pipeline and there was instruction overhead. Due to these effects there was a huge performance decrease. 2.twolf Our technique showed more than 13% performance improvement. The scheduler based technique could not effectively parallelize two major loops. In our technique we could cover many more loops than the scheduler based technique. 3.vpr The results are exactly the same. Though we were able to cover many loops not covered by the scheduler technique, there was one outer loop which includes many such inner loops. The performance is also dominated by that single loop. 5.Hardware based improvements In this section we discuss some improvements that can be done over the basic technique. 12

13 8 Hardware based improvements In this section we discuss some improvements that can be done over the basic technique. 8.1 Commit overlap In a typical speculative multithreading processor, the commit of one thread starts only after the commit of the previous thread. In the update phase of a thread execution, the thread is going to update the global variables. So potentially it will not cause any additional mis-speculations. So the next thread can start committing instructions once the previous thread has reached the update phase. Due to this we could overlap the commit time to some extent. In [14] commit time was identified as one of the major bottlenecks and different techniques were studied. 8.2 Update in hardware In [11][12] hardware based synchronization were proposed to improve the performance of some common type of update operations. The update phase in our threads is also a kind of synchronization. So we could use hardware support to improve the performance of such operations. As we saw in the results that some of the benchmarks suffer from small code size. But if the code size is increased by unrolling the loops, the structure of the loop is affected and it no longer has the phase based structure. Now we would be having two or more update sections distributed over the entire iteration. This can be a severe bottleneck since it could cause synchronization (in the basic technique synchronization is used to guard the update phase). We propose a hardware buffer that would be used to hold the update instructions and their related load instructions. The update instructions instead of executing are sent to this buffer. When the previous thread sends a signal to start the update phase, the instructions in this buffer can be executed. So we are now able to maintain the order of the updates and at the same time we won t suffer from synchronization. 9 Conclusion In this report we study how OpenMP hints are applicable to multi-core architectures. The OpenMP could be applied to extract fine-grained parallelism and also to extract speculative parallelism. In this report we concentrated more on the later part - how the user can expose the parallelism so that the compiler and hardware are more efficient. We identified a typical structure of loops that can be parallelized. The user specifies the loop in such a structure. The compiler or the user can accordingly schedule the code. When we compared the code generated in such a way with the code generated using a typical scheduling based technique, we found that 13

14 our technique showed much better performance. Also we propose some hardware techniques which can further improve the performance. 10 Future Work We need to analyze the other remaining benchmarks. It is not clear if the framework we present is applicable to programs like gcc, parser, etc. And we have to complete the hardware based techniques and study the performance improvement. Also by analyzing more benchmarks we would be able to get a deeper understanding on their behavior and any apparent patterns. In our analysis we found that load-balancing could be an important factor in some benchmarks. Some iterations could execute faster than the others due to different control flow paths e.g. Vpr. References [1] Power4: A Dual-CPU Processor Chip, Kahle, J., Microprocessor Forum 99 (October 1999). [2] MAJC: Microprocessor Architecture for Java Computing Tremblay, M., HotChips 99 (August 1999). [3] Multiscalar processors, Sohi, G. S., Breach, S., and Vijaykumar, T.. In Proceedings of the 22nd ISCA (June 1995). [4] Clustered Speculative Multithreaded Processors, Marcuello, P., and Gonzlez, A.. In Proc. of the ACM Int. Conf. on Supercomputing (June 1999). [5] The Superthreaded Processor Architecture, Tsai, J.-Y., Huang, J., Amlo, C., Lilja, D., AND Yew, P.-C.. IEEE Transactions on Computers, Special Issue on Multithreaded Architectures 48, 9 (September 1999). [6] Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry. The Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), San Jose, CA, USA, Oct 7-9, [7] Exploiting Multiple Levels of Parallelism in OpenMP: A Case Study, Eduard Ayguade, Xavier Martoreli, Jesus Labarta, Marc Gonzalez and Nacho Navarro, Proceedings of the 1999 International Conference on Parallel Processing. [8] Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications by Jos F. Martnez and Josep Torrellas, 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October

15 [9] A Clustered Approach to Multithreaded Processors, Venkata Krishnan and Josep Torrellas, International Parallel Processing Symposium (IPPS), March [10] Using Thread-Level Speculation to Simplify Manual Parallelization, Manohar Prabhu and Kunle Olukotun, Proceedings of the 2003 Principles and Practices of Parallel Programming, San Diego, CA, June [11] The Cedar system and an initial performance study, Proceedings of the 20th annual international symposium on Computer architecture, [12] On data synchronization for multiprocessors, H.-M. Su, P.-C. Yew, International Conference on Computer Architecture, [13] ORC-OpenMP: An OpenMP Compiler Based on ORC, Yongjian Chen, Jianjiang Li, ShengyuanWang, and DingxingWang, Tsinghua University, China. [14] Removing Architectural Bottlenecks to the Scalability of Speculative Parallelization, Milos Prvulovic, Maria Jesus Garzaran, Lawrence Rauchwerger, and Josep Torrellas, 28th Annual International Symposium on Computer Architecture (ISCA), June [15] The Stanford Hydra CMP, Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mike Chen and Kunle Olukotum, IEEE Micro, march-april [16] Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads, Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry. The 2nd International Symposium on Code Generation and Optimization (CGO-2004), Palo Alto, CA, USA, March 20-24,

Putting Checkpoints to Work in Thread Level Speculative Execution

Putting Checkpoints to Work in Thread Level Speculative Execution Putting Checkpoints to Work in Thread Level Speculative Execution Salman Khan E H U N I V E R S I T Y T O H F G R E D I N B U Doctor of Philosophy Institute of Computing Systems Architecture School of

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem

More information

Java Virtual Machine: the key for accurated memory prefetching

Java Virtual Machine: the key for accurated memory prefetching Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

More information

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86

More information

Unified Architectural Support for Soft-Error Protection or Software Bug Detection

Unified Architectural Support for Soft-Error Protection or Software Bug Detection Unified Architectural Support for Soft-Error Protection or Software Bug Detection Martin Dimitrov and Huiyang Zhou School of Electrical Engineering and Computer Science Motivation It is a great challenge

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

Multithreading Lin Gao cs9244 report, 2006

Multithreading Lin Gao cs9244 report, 2006 Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............

More information

Course Development of Programming for General-Purpose Multicore Processors

Course Development of Programming for General-Purpose Multicore Processors Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 wzhang4@vcu.edu

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

Intel Pentium 4 Processor on 90nm Technology

Intel Pentium 4 Processor on 90nm Technology Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended

More information

Chip Multi-Processor Scalability for Single-Threaded Applications

Chip Multi-Processor Scalability for Single-Threaded Applications Chip Multi-Processor Scalability for Single-Threaded Applications Neil Vachharajani, Matthew Iyer, Chinmay Ashok Manish Vachharajani, David I. August, and Daniel Connors Department of Computer Science

More information

BSC vision on Big Data and extreme scale computing

BSC vision on Big Data and extreme scale computing BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,

More information

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design. Enhancing Memory Level Parallelism via Recovery-Free Value Prediction Huiyang Zhou Thomas M. Conte Department of Electrical and Computer Engineering North Carolina State University 1-919-513-2014 {hzhou,

More information

Limits of Thread-Level Parallelism in Non-numerical Programs

Limits of Thread-Level Parallelism in Non-numerical Programs Vol. 47 No. SIG 7(ACS 14) IPSJ Transactions on Advanced Computing Systems May 2006 Regular Paper Limits of Thread-Level Parallelism in Non-numerical Programs Akio Nakajima,, Ryotaro Kobayashi, Hideki Ando

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Understanding Hardware Transactional Memory

Understanding Hardware Transactional Memory Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence

More information

The SpiceC Parallel Programming System of Computer Systems

The SpiceC Parallel Programming System of Computer Systems UNIVERSITY OF CALIFORNIA RIVERSIDE The SpiceC Parallel Programming System A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms

SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms 27 th Symposium on Parallel Architectures and Algorithms SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY Nuno Diegues, Paolo Romano and Stoyan Garbatov Seer: Scheduling for Commodity

More information

BLM 413E - Parallel Programming Lecture 3

BLM 413E - Parallel Programming Lecture 3 BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several

More information

Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu. Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu

Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu. Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu Continuous Monitoring using MultiCores Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu Motivation Intrusion detection Intruder gets

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Reducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications

Reducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications Reducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications Abstract Thread-Level Speculation has been used to take advantage of multicore processors

More information

THE VELOCITY COMPILER: EXTRACTING EFFICIENT MULTICORE EXECUTION FROM LEGACY SEQUENTIAL CODES MATTHEW JOHN BRIDGES A DISSERTATION

THE VELOCITY COMPILER: EXTRACTING EFFICIENT MULTICORE EXECUTION FROM LEGACY SEQUENTIAL CODES MATTHEW JOHN BRIDGES A DISSERTATION THE VELOCITY COMPILER: EXTRACTING EFFICIENT MULTICORE EXECUTION FROM LEGACY SEQUENTIAL CODES MATTHEW JOHN BRIDGES A DISSERTATION PRESENTED TO THE FACULTY OF PRINCETON UNIVERSITY IN CANDIDACY FOR THE DEGREE

More information

HPC Wales Skills Academy Course Catalogue 2015

HPC Wales Skills Academy Course Catalogue 2015 HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses

More information

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution

More information

Using Predictive Adaptive Parallelism to Address Portability and Irregularity

Using Predictive Adaptive Parallelism to Address Portability and Irregularity Using Predictive Adaptive Parallelism to Address Portability and Irregularity avid L. Wangerin and Isaac. Scherson {dwangeri,isaac}@uci.edu School of Computer Science University of California, Irvine Irvine,

More information

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs Presenter: Bo Zhang Yulin Shi Outline Motivation & Goal Solution - DACOTA overview Technical Insights Experimental Evaluation

More information

On-line Trace Based Automatic Parallelization of Java Programs on Multicore Platforms

On-line Trace Based Automatic Parallelization of Java Programs on Multicore Platforms On-line Trace Based Automatic Parallelization of Java Programs on Multicore Platforms Yu Sun and Wei Zhang Department of ECE, Virginia Commonwealth University wzhang4@vcu.edu Abstract We propose a new

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Design and Implementation of the Heterogeneous Multikernel Operating System

Design and Implementation of the Heterogeneous Multikernel Operating System 223 Design and Implementation of the Heterogeneous Multikernel Operating System Yauhen KLIMIANKOU Department of Computer Systems and Networks, Belarusian State University of Informatics and Radioelectronics,

More information

Improving System Scalability of OpenMP Applications Using Large Page Support

Improving System Scalability of OpenMP Applications Using Large Page Support Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Code Coverage Testing Using Hardware Performance Monitoring Support

Code Coverage Testing Using Hardware Performance Monitoring Support Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye Matthew Iyer Vijay Janapa Reddi Daniel A. Connors Department of Electrical and Computer Engineering University of Colorado

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Performance Impacts of Non-blocking Caches in Out-of-order Processors Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order

More information

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov

More information

Cellular Computing on a Linux Cluster

Cellular Computing on a Linux Cluster Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results

More information

Instruction Set Architecture (ISA) Design. Classification Categories

Instruction Set Architecture (ISA) Design. Classification Categories Instruction Set Architecture (ISA) Design Overview» Classify Instruction set architectures» Look at how applications use ISAs» Examine a modern RISC ISA (DLX)» Measurement of ISA usage in real computers

More information

Software and the Concurrency Revolution

Software and the Concurrency Revolution Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Effective Utilization of Multicore Processor for Unified Threat Management Functions

Effective Utilization of Multicore Processor for Unified Threat Management Functions Journal of Computer Science 8 (1): 68-75, 2012 ISSN 1549-3636 2012 Science Publications Effective Utilization of Multicore Processor for Unified Threat Management Functions Sudhakar Gummadi and Radhakrishnan

More information

Testing Database Performance with HelperCore on Multi-Core Processors

Testing Database Performance with HelperCore on Multi-Core Processors Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob Problem

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Driving force. What future software needs. Potential research topics

Driving force. What future software needs. Potential research topics Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #

More information

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available: Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.

More information

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier. Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core

More information

References 149. [85] D.W. Wall, Limits of Instruction-Level Parallelism, Tech. Report WRL 93/6, Digital Western Research Laboratory,

References 149. [85] D.W. Wall, Limits of Instruction-Level Parallelism, Tech. Report WRL 93/6, Digital Western Research Laboratory, REFERENCES [1] P. Ahuja, K. Skadron, M. Martonosi and D. Clark, Multipath Execution: Oportunities and Limits, in Proc of the 12th Int. Conf. on Supercomputing, pp. 101-108,1998. [2] H. Akkary and M.A.

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

THE NAS KERNEL BENCHMARK PROGRAM

THE NAS KERNEL BENCHMARK PROGRAM THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures

More information

Using Power to Improve C Programming Education

Using Power to Improve C Programming Education Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden jonas.skeppstedt@cs.lth.se jonasskeppstedt.net jonasskeppstedt.net jonas.skeppstedt@cs.lth.se

More information

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

More information

Thread level parallelism

Thread level parallelism Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

More information

Data Structure Oriented Monitoring for OpenMP Programs

Data Structure Oriented Monitoring for OpenMP Programs A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs Edmond Kereku, Tianchao Li, Michael Gerndt, and Josef Weidendorfer Institut für Informatik, Technische Universität München,

More information

Reliable Systolic Computing through Redundancy

Reliable Systolic Computing through Redundancy Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/

More information

Supporting OpenMP on Cell

Supporting OpenMP on Cell Supporting OpenMP on Cell Kevin O Brien, Kathryn O Brien, Zehra Sura, Tong Chen and Tao Zhang IBM T. J Watson Research Abstract. The Cell processor is a heterogeneous multi-core processor with one Power

More information

PROBLEMS #20,R0,R1 #$3A,R2,R4

PROBLEMS #20,R0,R1 #$3A,R2,R4 506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,

More information

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do

More information

The University of Arizona Department of Electrical and Computer Engineering Term Paper (and Presentation) for ECE 569 Fall 2006 21 February 2006

The University of Arizona Department of Electrical and Computer Engineering Term Paper (and Presentation) for ECE 569 Fall 2006 21 February 2006 The University of Arizona Department of Electrical and Computer Engineering Term Paper (and Presentation) for ECE 569 Fall 2006 21 February 2006 The term project for this semester is an independent study

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

An Introduction to Parallel Computing/ Programming

An Introduction to Parallel Computing/ Programming An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European

More information

A Pattern-Based Approach to. Automated Application Performance Analysis

A Pattern-Based Approach to. Automated Application Performance Analysis A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,

More information

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

An Event-Driven Multithreaded Dynamic Optimization Framework

An Event-Driven Multithreaded Dynamic Optimization Framework In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), Sept. 2005. An Event-Driven Multithreaded Dynamic Optimization Framework Weifeng Zhang Brad Calder

More information

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Journal of Instruction-Level Parallelism 7 (25) 1-28 Submitted 2/25; published 6/25 Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Joshua Kihm Alex Settle Andrew

More information

FPGA area allocation for parallel C applications

FPGA area allocation for parallel C applications 1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University

More information

A graph-oriented task manager for small multiprocessor systems.

A graph-oriented task manager for small multiprocessor systems. A graph-oriented task manager for small multiprocessor systems. Xavier Verians 1, Jean-Didier Legat 1, Jean-Jacques Quisquater 1 and Benoit Macq 2 1 Microelectronics Laboratory, Université Catholique de

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

DATABASE CONCURRENCY CONTROL USING TRANSACTIONAL MEMORY : PERFORMANCE EVALUATION

DATABASE CONCURRENCY CONTROL USING TRANSACTIONAL MEMORY : PERFORMANCE EVALUATION DATABASE CONCURRENCY CONTROL USING TRANSACTIONAL MEMORY : PERFORMANCE EVALUATION Jeong Seung Yu a, Woon Hak Kang b, Hwan Soo Han c and Sang Won Lee d School of Info. & Comm. Engr. Sungkyunkwan University

More information

An Adaptive Task-Core Ratio Load Balancing Strategy for Multi-core Processors

An Adaptive Task-Core Ratio Load Balancing Strategy for Multi-core Processors An Adaptive Task-Core Ratio Load Balancing Strategy for Multi-core Processors Ian K. T. Tan, Member, IACSIT, Chai Ian, and Poo Kuan Hoong Abstract With the proliferation of multi-core processors in servers,

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?

More information

Inside the Erlang VM

Inside the Erlang VM Rev A Inside the Erlang VM with focus on SMP Prepared by Kenneth Lundin, Ericsson AB Presentation held at Erlang User Conference, Stockholm, November 13, 2008 1 Introduction The history of support for

More information

Load-Balanced Pipeline Parallelism

Load-Balanced Pipeline Parallelism Load-Balanced Pipeline Parallelism Md Kamruzzaman Steven Swanson Dean M. Tullsen Computer Science and Engineering University of California, San Diego {mkamruzz,swanson,tullsen@cs.ucsd.edu ABSTRACT Accelerating

More information

Visualization Enables the Programmer to Reduce Cache Misses

Visualization Enables the Programmer to Reduce Cache Misses Visualization Enables the Programmer to Reduce Cache Misses Kristof Beyls and Erik H. D Hollander and Yijun Yu Electronics and Information Systems Ghent University Sint-Pietersnieuwstraat 41, Gent, Belgium

More information

A Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization

A Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization A Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization Chun Chen, Jacqueline Chame, Mary Hall, and Kristina Lerman University of Southern California/Information Sciences

More information

picojava TM : A Hardware Implementation of the Java Virtual Machine

picojava TM : A Hardware Implementation of the Java Virtual Machine picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer

More information

Center for Programming Models for Scalable Parallel Computing

Center for Programming Models for Scalable Parallel Computing Overall Project Title: Coordinating PI: Subproject Title: PI: Reporting Period: Center for Programming Models for Scalable Parallel Computing Rusty Lusk, ANL Future Programming Models Guang R. Gao Final

More information

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Scheduling Task Parallelism on Multi-Socket Multicore Systems Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction

More information

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING A SURVEY ON MAPREDUCE IN CLOUD COMPUTING Dr.M.Newlin Rajkumar 1, S.Balachandar 2, Dr.V.Venkatesakumar 3, T.Mahadevan 4 1 Asst. Prof, Dept. of CSE,Anna University Regional Centre, Coimbatore, newlin_rajkumar@yahoo.co.in

More information

The Methodology of Application Development for Hybrid Architectures

The Methodology of Application Development for Hybrid Architectures Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department

More information

Chapter 2 Parallel Computer Architecture

Chapter 2 Parallel Computer Architecture Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of

More information

The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang

The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015) The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang Nanjing Communications

More information

Dynamic Adaptive Feedback of Load Balancing Strategy

Dynamic Adaptive Feedback of Load Balancing Strategy Journal of Information & Computational Science 8: 10 (2011) 1901 1908 Available at http://www.joics.com Dynamic Adaptive Feedback of Load Balancing Strategy Hongbin Wang a,b, Zhiyi Fang a,, Shuang Cui

More information