OpenMP in Multicore Architectures
|
|
- Lionel Curtis
- 8 years ago
- Views:
Transcription
1 OpenMP in Multicore Architectures Venkatesan Packirisamy, Harish Barathvajasankar Abstract OpenMP is an API (application program interface) used to explicitly direct multi-threaded, shared memory parallelism. With the advent of Multi-core processors, there has been renewed interest in parallelizing programs. Multi-core offers support to execute threads in parallel but at the same time the cost of communication is very less. This opens up new domains to extract parallelism. The aim of our project is to study how OpenMP could be used in Multi-core processors. We investigated this in two parts - extracting fine grained parallelism and extracting speculative parallelism. The report mainly concentrates on the second part - speculative parallelism, though we offer some insights into fine grained parallelism. Our results show the specifying hints through OpenMP directives is important to expose parallelism. Also we propose a hardware technique that could improve performance of thread level speculation. 1 Introduction OpenMP is an API (application program interface) used to explicitly direct multithreaded, shared memory parallelism. OpenMP was introduced in 1997 to standardize programming extensions for shared memory machines. It has been widely used since then in parallelizing scientific programs. In OpenMP the user specifies the regions in the code that are parallel. The user also specifies necessary synchronization like locks, barriers, etc to ensure correct execution of the parallel region. At runtime threads are forked for the parallel region and are typically executed in different processors sharing the same memory and address space. Recently many chip manufacturing companies have announced multi-core processors [1][2]. In multi-core processors, each processor die has many processing elements and they typically have a common L2 cache. The whole die acts as a traditional multiprocessor system except that they now share the L2 cache. The advantage now is that we could execute different threads in these processing elements and the communication cost between the threads is very less since they share the L2 cache. Another advantage of having multiple cores is that we could use these cores to extract thread level parallelism in a program and hence increase the performance of the single program. A lot of research has been done on this area. Many techniques rely on hardware based mechanisms [3][4] and some depend on compiler to extract the threads [5][6]. 1
2 So we see that we have a system which can be used as a traditional multiprocessor system with minimum communication cost and also it can be used to improve single thread programs. OpenMP so far have been used only in traditional multiprocessor environments and we saw that the multi-core processors are very similar to a traditional multiprocessor. So it is natural to think of OpenMP in multi-core processors. Also we saw that multi-core processors can be used as a traditional multiprocessors or it can be used to implement TLS. Consequently OpenMP can also be used in both the ways: Fine grained parallelism Because of the minimal communication cost in multicore processors, we could parallelize loops that suffered due to high communication cost in traditional multiprocessors. We could parallelize inner loops that were previously not parallelizable due to their communication cost and size. And OpenMP can be extended to support additional directives needed to support this fine-grained parallelism. Speculative Parallelism We saw that multi-core processors can be used to improve single thread performance by using thread level speculation (TLS). OpenMP can be used as a way for the user to specify hints to the compiler and hardware, exposing the parallelism in the speculative threads. The rest of the paper is organized as follows: in section 2 we will discuss the related work done in this field. In section 3 we would discuss the potential for using OpenMP to extract fine grained parallelism. In section 4 we will discuss the simulation environment used in the study of speculative parallelism. In section 5 we would discuss how we could use OpenMP for speculative parallelization. In section 6 we would present the parallelization framework we used to parallelize the loops. In section 7 we will see the results comparing our parallelized loops and the loops parallelized by the compiler [5]. In section 8 we would discuss the possible hardware optimizations possible. In section 9 we would conclude and in section 10 we would discuss the future work. 2 Related Work Using OpenMP for extracting multi-level parallelism is studied in [6]. Here some OpenMP directives were proposed to extract multi-level parallelism. We would want a similar technique for multi-core environments, where say the outer loop is parallelized between processors and the inner loop is parallelized for the processing elements inside each processor. [8] Tried to apply speculative techniques to improve OpenMP programs. Here the threads do not always (depending on the OpenMP hint) wait at synchronization points. So violations could occur which are detected and the offending thread is squashed. In our report we use OpenMP to improve thread level parallelism of integer programs which are not inherently parallel. 2
3 Thread level speculation has been widely studied as a technique which can extract thread level speculation. Here the threads are either extracted in hardware [9][10][11][15] or in software [12][13][16] and are executed in parallel. The threads are not always independent, so if a dependence violation occurs, it is typically checked in hardware and the offending thread is squashed and restarted. In [8-ant] compiler is used to schedule the instructions in the threads so as to reduce the impact of inter-thread data dependencies. The code generated using this technique is used to compare with the code we generate. In [12] pipelined execution of threads is studied. The parallelization framework we use is very similar to the thread pipelining, except our method is more general and is more easily applicable to different loops. Also in our case all compiler optimizations [8] are still applicable. In [13][14] hardware based synchronization mechanisms have been studied to improve the performance of the synchronization operations. In this report we propose to use hardware based update mechanisms which are very similar in principle to the above mechanisms. [16] Used manual parallelization to improve performance of speculative multithreading. Our report does not modify the basic algorithm design in the code. It only exposes parallelism to the compiler and hardware by providing hints. 3 Fine-grained parallelism As the first possibility of using OpenMP in multi-cores, we tried to extract parallelism in the inner loop of SpecOMP benchmarks. But we were not able use the OpenMP compiler [15] to compile the SpecOMP benchmarks. So we could not proceed very much in this direction. We tried to study the possible parallelism in SpecOMP benchmarks by manually looking at the code. In the benchmark equake, we found that the OpenMP had been used to parallelize 14 of the major loops. There were some inner loops and also there are some loops that that has not been parallelized by OpenMP. But when we looked at the loops closer, we found that the loops that were not parallelized have small iteration count. Also the inner loops also have very small loop body. We were not able to analyze the FORTRAN based benchmarks because of our limited familiarity with FORTRAN. Due to the apparent lack of potential and the difficulty in understanding and compiling the SpecOMP code, we abandoned this part and started focusing on the other part - speculative parallelism. The rest of the paper concentrates on the main focus - speculative parallelization of integer programs using OpenMP directives. First we shall see the simulation methodology and the basic multi-core architecture used in for this project. 3
4 4 Simulation Methodology For our study we used a simplescalar based Chip Multiprocessor (CMP) simulator. The basic configuration of the simulator is given in fig 1. Figure 1: Basic Configuration 4.1 Basic Architecture In the simulator CMP architecture we have 4 processing elements. Each has the basic configuration stated above. And they have private L1 cache and a shared L2 cache. Also there is a signal table for each processing element which is used to synchronize to avoid squashing. Each processing element has a store buffer of size 2K entries and a load-address buffer of size 2K entries. When a speculative thread executes in a processor, the stores values are put into the store buffer. When a load is issued the address of the load is put in the address buffer. For each store in a thread, subsequent thread s address buffers are checked to see if there is any dependence violation. If there is any violation detected, the whole thread and the subsequent threads are squashed and the execution of the first thread violating is restarted. 4.2 Tool used One of the major parts of our project is to analyze the different benchmark programs to find scope for the user to specify hints. We used a simplescalar based tool to study the loops. We use the tool to identify frequently occurring data dependencies between threads. Though we have other tools based on dependence profiling and SCCs, we chose to implement our own tool because the other tools don t give which PC address actually cause the dependence. We need this information because we manually analyze the benchmarks and we need to find which statements actually cause dependence violation. To use the tool, we just mark the loop we want to study, and the tool would print out the PC addresses of store-load dependence pairs. 4
5 Figure 2: CMP Architecture 4.3 Benchmarks studied We aimed to study 6 of the Spec2000 integer benchmarks. We did analyze 6 benchmarks (fig. 3), we present our results only for 3 of the benchmarks. For the remaining benchmarks we provide insights on what the behavior of their loops and how it can be parallelized. Figure 3: Benchmark Description 4.4 Simulation methodology Since we had run the simulation many times till we get good performance we worked only with test input sets. For all three benchmarks for which the results are presented only for test input sets. 5 Speculative parallelization In this section we would use the directives given by the user to improve the performance of hardware and compiler mechanisms to improve thread level speculation. First we will see what the possibilities are. Then in section 5.2 we will present a 5
6 framework for exposing parallelism. In section 5.3 we would compare our framework to the other related work done. In section 5.4 we would discuss how the different benchmarks we studied fit in this framework. 5.1 OpenMP directives There have been many hardware and compiler based techniques proposed to improve Thread level speculation. Compiler techniques aim to reduce the complexity of the hardware techniques by moving some of the complexity to the hardware. Also the compiler due to its global view is better placed to extract Thread level speculation. Even the hardware based techniques usually assume some form of compiler support in selecting the region or to specify synchronization. But the compiler also lacks some run time information. One way to solve this problem is to allow the user to specify some hints that would help the compiler as well as the architecture. The possible directives from the user can be broadly classified into three categories: Identifying regions to parallelize One of the major decisions for compiler or hardware to make is which loop or region to parallelize. This is a very hard decision and usually done after taking the execution time of the loop and the dependence properties of the loop. A user (or an optimizer) is in a better position to take the decision. The user can just identify the loop and communicate it through a directive. This is very similar to the OpenMP parallel directive, but now the loop is only speculatively parallel. Changing loop body After identifying the loop, the hardware can execute the threads in different processors simultaneously. But this may cause frequent squashes due to dependence violations. So the hardware could selectively synchronize on some dependencies. Usually this requires compiler support and the compiler can also schedule the code to reduce the impact of data dependencies. Here also the user is in a better position to optimize the code. The user could just indicate specific information for each load whether there is going to be a dependence violation or the user can also re-arrange the loop to reduce the dependence violation. This is the part we try to answer in our project. Other specific information Other than the above two major information, the hardware may still need some specific information to efficiently execute the threads. Eg. The user could specify the typical iteration count for the loop. The compiler can take important decisions based on this. The user could indicate the load balancing information that could help the hardware to efficiently distribute the work. This part we don t consider in our project. 6
7 6 Framework for parallelization In this section we will discuss the typical behavior of the benchmarks loops. From the behavior we derive a framework for manual parallelization. By framework, we mean a specific code layout on how the code should be arranged so that it exposes to the hardware or compiler the parallelism in the loop. 6.1 Method of analysis To analyze a benchmark program, we first get the profile information and find the loops that have high coverage. Then the data dependence profile information is looked at. If the number of dependencies is really large, that loop is not considered further. Then the simplescalar based tool is used to find the actual statements in the source code that cause the mis-speculations. Then the source code is analyzed to see why exactly the dependence occurred. A typical loop in integer program would look like: From our analysis, we could classify the benchmark loops as two types: Possible parallel loops In most loops, the data needed to start the next iteration is not dependent on the main processing done in the iteration. Eg. The inner loop in the function refresh potential of mcf. Here the induction variable is the only data needed to start the next iteration and the induction variable is independent of the processing in the current iteration. Serial loops But there are many loops, were the next iteration is very much dependent on the processing in the current iteration. Eg. The loop in the sort basket() function of mcf has a very good coverage. But the loop is very serial - it is a quick sort algorithm, the next iteration is started based on the value of the current iteration. 6.2 Ideal parallel execution We saw that in some cases the only thing we need to start the thread is the induction variable. So ideally we want the loop execution to be. 7
8 Now we can start the loop immediately after generating data for the next iteration. But when we examine the loops, we found an interesting trend in the loops. 6.3 Usual model of Execution Apart from generating the values for the next iteration, the loops usually update some global variables. The results of the processing done in the current iteration are updated to some global variables or array. Updating values at the end seems to be very common across benchmarks. And unlike the induction variables these cannot be moved to the top of the iteration as it needs the results of the entire iteration. Eg. Consider the end of the iteration code of a loop in new_dbox_a of twolf: } else { m = 0 ; } tmp_missing_rows[net] = -m ; } delta_vert_cost += ((tmp_num_feeds[net] - num_feeds[net]) + (tmp_missing_rows[net] - missing_rows[net])) * 2 * rowheight ; We see, at the end of the iteration, the global variable delta vert cost is updated with the values calculated in the current iteration. Let s see how the current TLS mechanisms handle such situations: 8
9 Speculation One way is to execute the code speculatively. Usually the update occurs towards the end of the iteration and if the non-speculative thread completes first, then it doesn t cause a dependence violation. This may not always work, if the speculative thread runs faster, it experiences a squash. Synchronization We could synchronize on this dependency. This will avoid the squash, but it could serialize the code. In our example it is ok, but these types of reductions do occur in the middle of the iteration. Eg. Inner loop in twolf new dbox() Recovery code Recovery code could help, but there are limitations in applying the recovery code. Consider the loop in twolf - new dbox(): } *costptr += ABS( newx - new_mean ) - ABS( oldx - old_mean ) ; } Here the reduction occurs in the inner loop. So if we want to recover, we have to execute the entire inner loop. 6.4 Our Technique Among these techniques, synchronization seems to be a good choice. But the performance could be affected if the update occurs in the middle of the loop. To reduce this impact, we could privatize the update and add the results to a temporary variable. And at the end of the loop, the variable is updated to the global structure. In our technique we try to move the global update down. If it occurs in the middle we try to privatize the local update. Then in our basic technique the global update is guarded by wait instruction. We also identify the induction variable so that it does not cause un-necessary synchronization or speculation. Based on the above analysis, we believe that a manually parallelized loop should have the following structure. And in our basic technique we synchronize before entering the update phase. Different hardware based optimizations are possible and they are discussed in section 8. 9
10 From our initial analysis we found that the coverage of the loops that could have such framework structure is given below: Benchmark potential parallel loops (% of execution time coverage) Mcf Twolf Vpr Some of the loops were later found be either very large to be parallelized or they have high frequency inter-thread data dependences. 6.5 OpenMP hints The user has to specify hints that identify these phases in the loop. The user himself can rearrange the code, privatize the variables and do other optimizations. The compiler can do the same thing provided the basic hints identifying the phases are provided. 6.6 Behavior of benchmark programs In this section we will see how the different benchmark programs we analyzed fit into the framework we discussed in the previous section. Also we suggest techniques that could be further studied for the different benchmarks. a)twolf Two high coverage loops exactly fit into the framework. For these two loops it is very difficult to parallelize without the user directives. Some of the loops suffer from small size. For these loops unrolling could help, but unrolling could disrupt the loop structure and could cause mis-speculations. The hardware technique discussed later could be used for such loops. b) Mcf Mcf also has many loops that satisfy the code framework. Here also the loops suffer from small size. They could be unrolled but that violates the framework. c) vpr There were some loops that satisfy the global update property. But these loops are enclosed by an outer loop which is also parallel. Even a simple compiler technique can identify the parallelism. So we don t gain much by using our technique. d) Gzip and bzip2 Here there are some parallel loops, but they all have very small code size. They could give good performance if unrolled. We did not find any major loops which satisfy our framework. e) Vortex Vortex loops have lots of function calls and the function calls are usually have complicated nesting which made analyzing the loops very hard. From the result of using the tool we find that many dependences occur due to a status variable which is updated almost everywhere. But we believe that the value read would always be 0 unless an error occurred. So we could modify the hardware to squash the thread only when the value read was changed. 10
11 6.7 Comparison with other techniques In our technique we have three different phases of execution - induction variable calculation, main processing and global update. This is very similar to the thread pipelining strategy of [14]. In [14] the first two stages are same, but the update stage is replaced by a write-back stage. In [14], the aim is to always find all the information needed for other thread as soon as possible; in our case we move the updates downward. In [16], manual parallelization was discussed. But there the focus was on changing the code completely to increase parallelism. In our technique the user only exposes the available parallelism and the parallelism is never created. 7 Results 7.1 Results of basic technique The aim of passing user hints is to overcome the limitations of the compiler and hardware. So we compare our technique with the compiler based technique [15], where the code is scheduled to reduce the impact of data dependencies. We were able to get the code generated by using the scheduling technique from Prof. Antonia Zhai who had implemented that technique in the SUIF compiler. But the code had some directives specific to their environment. Also the code overhead was very huge in their code. So we replicated the effect of their code in the original benchmark code. Signal and wait instructions were inserted at exactly the same points. And the loops were unrolled exactly the same number of times. All other modifications are also done to match their code completely. Then we applied our technique over that code. Sometimes the same loops could be better scheduled by using our technique. At other times more loops could be parallelized which were not parallelized due to some limitations. In fig. 5 we see the comparison of the scheduling based technique and our framework based technique. The bars indicate the percentage decrease in the execution time after applying the technique. We see that performance sometimes is less than the single thread execution. One of the main reasons is the instruction overhead. Fig. 6 shows the percentage increase in the instructions executed after applying the technique. 7.2 Analysis of results Here we try to analyze the performance of each of the benchmarks. 1.mcf Though there are parallel loops the size of the loops are very small. When we unrolled the loops, the loops suffered mis-speculations. So the two loops pulled back the performance. In the case of the scheduler based code, the second loop was synchronized. This lead to serial execution and at the same 11
12 Figure 4: Performance Impact of the technique Figure 5: increase in the number of instructions executed time there were gaps in the pipeline and there was instruction overhead. Due to these effects there was a huge performance decrease. 2.twolf Our technique showed more than 13% performance improvement. The scheduler based technique could not effectively parallelize two major loops. In our technique we could cover many more loops than the scheduler based technique. 3.vpr The results are exactly the same. Though we were able to cover many loops not covered by the scheduler technique, there was one outer loop which includes many such inner loops. The performance is also dominated by that single loop. 5.Hardware based improvements In this section we discuss some improvements that can be done over the basic technique. 12
13 8 Hardware based improvements In this section we discuss some improvements that can be done over the basic technique. 8.1 Commit overlap In a typical speculative multithreading processor, the commit of one thread starts only after the commit of the previous thread. In the update phase of a thread execution, the thread is going to update the global variables. So potentially it will not cause any additional mis-speculations. So the next thread can start committing instructions once the previous thread has reached the update phase. Due to this we could overlap the commit time to some extent. In [14] commit time was identified as one of the major bottlenecks and different techniques were studied. 8.2 Update in hardware In [11][12] hardware based synchronization were proposed to improve the performance of some common type of update operations. The update phase in our threads is also a kind of synchronization. So we could use hardware support to improve the performance of such operations. As we saw in the results that some of the benchmarks suffer from small code size. But if the code size is increased by unrolling the loops, the structure of the loop is affected and it no longer has the phase based structure. Now we would be having two or more update sections distributed over the entire iteration. This can be a severe bottleneck since it could cause synchronization (in the basic technique synchronization is used to guard the update phase). We propose a hardware buffer that would be used to hold the update instructions and their related load instructions. The update instructions instead of executing are sent to this buffer. When the previous thread sends a signal to start the update phase, the instructions in this buffer can be executed. So we are now able to maintain the order of the updates and at the same time we won t suffer from synchronization. 9 Conclusion In this report we study how OpenMP hints are applicable to multi-core architectures. The OpenMP could be applied to extract fine-grained parallelism and also to extract speculative parallelism. In this report we concentrated more on the later part - how the user can expose the parallelism so that the compiler and hardware are more efficient. We identified a typical structure of loops that can be parallelized. The user specifies the loop in such a structure. The compiler or the user can accordingly schedule the code. When we compared the code generated in such a way with the code generated using a typical scheduling based technique, we found that 13
14 our technique showed much better performance. Also we propose some hardware techniques which can further improve the performance. 10 Future Work We need to analyze the other remaining benchmarks. It is not clear if the framework we present is applicable to programs like gcc, parser, etc. And we have to complete the hardware based techniques and study the performance improvement. Also by analyzing more benchmarks we would be able to get a deeper understanding on their behavior and any apparent patterns. In our analysis we found that load-balancing could be an important factor in some benchmarks. Some iterations could execute faster than the others due to different control flow paths e.g. Vpr. References [1] Power4: A Dual-CPU Processor Chip, Kahle, J., Microprocessor Forum 99 (October 1999). [2] MAJC: Microprocessor Architecture for Java Computing Tremblay, M., HotChips 99 (August 1999). [3] Multiscalar processors, Sohi, G. S., Breach, S., and Vijaykumar, T.. In Proceedings of the 22nd ISCA (June 1995). [4] Clustered Speculative Multithreaded Processors, Marcuello, P., and Gonzlez, A.. In Proc. of the ACM Int. Conf. on Supercomputing (June 1999). [5] The Superthreaded Processor Architecture, Tsai, J.-Y., Huang, J., Amlo, C., Lilja, D., AND Yew, P.-C.. IEEE Transactions on Computers, Special Issue on Multithreaded Architectures 48, 9 (September 1999). [6] Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry. The Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), San Jose, CA, USA, Oct 7-9, [7] Exploiting Multiple Levels of Parallelism in OpenMP: A Case Study, Eduard Ayguade, Xavier Martoreli, Jesus Labarta, Marc Gonzalez and Nacho Navarro, Proceedings of the 1999 International Conference on Parallel Processing. [8] Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications by Jos F. Martnez and Josep Torrellas, 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October
15 [9] A Clustered Approach to Multithreaded Processors, Venkata Krishnan and Josep Torrellas, International Parallel Processing Symposium (IPPS), March [10] Using Thread-Level Speculation to Simplify Manual Parallelization, Manohar Prabhu and Kunle Olukotun, Proceedings of the 2003 Principles and Practices of Parallel Programming, San Diego, CA, June [11] The Cedar system and an initial performance study, Proceedings of the 20th annual international symposium on Computer architecture, [12] On data synchronization for multiprocessors, H.-M. Su, P.-C. Yew, International Conference on Computer Architecture, [13] ORC-OpenMP: An OpenMP Compiler Based on ORC, Yongjian Chen, Jianjiang Li, ShengyuanWang, and DingxingWang, Tsinghua University, China. [14] Removing Architectural Bottlenecks to the Scalability of Speculative Parallelization, Milos Prvulovic, Maria Jesus Garzaran, Lawrence Rauchwerger, and Josep Torrellas, 28th Annual International Symposium on Computer Architecture (ISCA), June [15] The Stanford Hydra CMP, Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mike Chen and Kunle Olukotum, IEEE Micro, march-april [16] Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads, Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry. The 2nd International Symposium on Code Generation and Optimization (CGO-2004), Palo Alto, CA, USA, March 20-24,
Putting Checkpoints to Work in Thread Level Speculative Execution
Putting Checkpoints to Work in Thread Level Speculative Execution Salman Khan E H U N I V E R S I T Y T O H F G R E D I N B U Doctor of Philosophy Institute of Computing Systems Architecture School of
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
More informationJava Virtual Machine: the key for accurated memory prefetching
Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationChapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
More informationA Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
More informationUnified Architectural Support for Soft-Error Protection or Software Bug Detection
Unified Architectural Support for Soft-Error Protection or Software Bug Detection Martin Dimitrov and Huiyang Zhou School of Electrical Engineering and Computer Science Motivation It is a great challenge
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationMultithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
More informationCourse Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 wzhang4@vcu.edu
More informationMulti-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
More informationIntel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended
More informationChip Multi-Processor Scalability for Single-Threaded Applications
Chip Multi-Processor Scalability for Single-Threaded Applications Neil Vachharajani, Matthew Iyer, Chinmay Ashok Manish Vachharajani, David I. August, and Daniel Connors Department of Computer Science
More informationBSC vision on Big Data and extreme scale computing
BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,
More informationCategories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.
Enhancing Memory Level Parallelism via Recovery-Free Value Prediction Huiyang Zhou Thomas M. Conte Department of Electrical and Computer Engineering North Carolina State University 1-919-513-2014 {hzhou,
More informationLimits of Thread-Level Parallelism in Non-numerical Programs
Vol. 47 No. SIG 7(ACS 14) IPSJ Transactions on Advanced Computing Systems May 2006 Regular Paper Limits of Thread-Level Parallelism in Non-numerical Programs Akio Nakajima,, Ryotaro Kobayashi, Hideki Ando
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationUnderstanding Hardware Transactional Memory
Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence
More informationThe SpiceC Parallel Programming System of Computer Systems
UNIVERSITY OF CALIFORNIA RIVERSIDE The SpiceC Parallel Programming System A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationSEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms
27 th Symposium on Parallel Architectures and Algorithms SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY Nuno Diegues, Paolo Romano and Stoyan Garbatov Seer: Scheduling for Commodity
More informationBLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
More informationCarlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu. Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu
Continuous Monitoring using MultiCores Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu Motivation Intrusion detection Intruder gets
More informationIntroduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
More informationReducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications
Reducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications Abstract Thread-Level Speculation has been used to take advantage of multicore processors
More informationTHE VELOCITY COMPILER: EXTRACTING EFFICIENT MULTICORE EXECUTION FROM LEGACY SEQUENTIAL CODES MATTHEW JOHN BRIDGES A DISSERTATION
THE VELOCITY COMPILER: EXTRACTING EFFICIENT MULTICORE EXECUTION FROM LEGACY SEQUENTIAL CODES MATTHEW JOHN BRIDGES A DISSERTATION PRESENTED TO THE FACULTY OF PRINCETON UNIVERSITY IN CANDIDACY FOR THE DEGREE
More informationHPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
More informationEE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
More informationUsing Predictive Adaptive Parallelism to Address Portability and Irregularity
Using Predictive Adaptive Parallelism to Address Portability and Irregularity avid L. Wangerin and Isaac. Scherson {dwangeri,isaac}@uci.edu School of Computer Science University of California, Irvine Irvine,
More informationDACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi
DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs Presenter: Bo Zhang Yulin Shi Outline Motivation & Goal Solution - DACOTA overview Technical Insights Experimental Evaluation
More informationOn-line Trace Based Automatic Parallelization of Java Programs on Multicore Platforms
On-line Trace Based Automatic Parallelization of Java Programs on Multicore Platforms Yu Sun and Wei Zhang Department of ECE, Virginia Commonwealth University wzhang4@vcu.edu Abstract We propose a new
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationThe Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.
White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationOptimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
More informationDesign and Implementation of the Heterogeneous Multikernel Operating System
223 Design and Implementation of the Heterogeneous Multikernel Operating System Yauhen KLIMIANKOU Department of Computer Systems and Networks, Belarusian State University of Informatics and Radioelectronics,
More informationImproving System Scalability of OpenMP Applications Using Large Page Support
Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
More informationCode Coverage Testing Using Hardware Performance Monitoring Support
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye Matthew Iyer Vijay Janapa Reddi Daniel A. Connors Department of Electrical and Computer Engineering University of Colorado
More informationMapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
More informationPerformance Impacts of Non-blocking Caches in Out-of-order Processors
Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order
More informationSolution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More informationComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
More informationCellular Computing on a Linux Cluster
Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results
More informationInstruction Set Architecture (ISA) Design. Classification Categories
Instruction Set Architecture (ISA) Design Overview» Classify Instruction set architectures» Look at how applications use ISAs» Examine a modern RISC ISA (DLX)» Measurement of ISA usage in real computers
More informationSoftware and the Concurrency Revolution
Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationEffective Utilization of Multicore Processor for Unified Threat Management Functions
Journal of Computer Science 8 (1): 68-75, 2012 ISSN 1549-3636 2012 Science Publications Effective Utilization of Multicore Processor for Unified Threat Management Functions Sudhakar Gummadi and Radhakrishnan
More informationTesting Database Performance with HelperCore on Multi-Core Processors
Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob Problem
More informationQuiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationDriving force. What future software needs. Potential research topics
Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #
More informationTools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:
Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.
More informationParallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
More informationReferences 149. [85] D.W. Wall, Limits of Instruction-Level Parallelism, Tech. Report WRL 93/6, Digital Western Research Laboratory,
REFERENCES [1] P. Ahuja, K. Skadron, M. Martonosi and D. Clark, Multipath Execution: Oportunities and Limits, in Proc of the 12th Int. Conf. on Supercomputing, pp. 101-108,1998. [2] H. Akkary and M.A.
More informationOverlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer
More informationTHE NAS KERNEL BENCHMARK PROGRAM
THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures
More informationUsing Power to Improve C Programming Education
Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden jonas.skeppstedt@cs.lth.se jonasskeppstedt.net jonasskeppstedt.net jonas.skeppstedt@cs.lth.se
More informationMPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
More informationThread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
More informationData Structure Oriented Monitoring for OpenMP Programs
A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs Edmond Kereku, Tianchao Li, Michael Gerndt, and Josef Weidendorfer Institut für Informatik, Technische Universität München,
More informationReliable Systolic Computing through Redundancy
Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/
More informationSupporting OpenMP on Cell
Supporting OpenMP on Cell Kevin O Brien, Kathryn O Brien, Zehra Sura, Tong Chen and Tao Zhang IBM T. J Watson Research Abstract. The Cell processor is a heterogeneous multi-core processor with one Power
More informationPROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
More informationCentralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures
Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do
More informationThe University of Arizona Department of Electrical and Computer Engineering Term Paper (and Presentation) for ECE 569 Fall 2006 21 February 2006
The University of Arizona Department of Electrical and Computer Engineering Term Paper (and Presentation) for ECE 569 Fall 2006 21 February 2006 The term project for this semester is an independent study
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationAn Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
More informationA Pattern-Based Approach to. Automated Application Performance Analysis
A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,
More informationHistorically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.
Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationAn Event-Driven Multithreaded Dynamic Optimization Framework
In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), Sept. 2005. An Event-Driven Multithreaded Dynamic Optimization Framework Weifeng Zhang Brad Calder
More informationUnderstanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors
Journal of Instruction-Level Parallelism 7 (25) 1-28 Submitted 2/25; published 6/25 Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Joshua Kihm Alex Settle Andrew
More informationFPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
More informationA graph-oriented task manager for small multiprocessor systems.
A graph-oriented task manager for small multiprocessor systems. Xavier Verians 1, Jean-Didier Legat 1, Jean-Jacques Quisquater 1 and Benoit Macq 2 1 Microelectronics Laboratory, Université Catholique de
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationDATABASE CONCURRENCY CONTROL USING TRANSACTIONAL MEMORY : PERFORMANCE EVALUATION
DATABASE CONCURRENCY CONTROL USING TRANSACTIONAL MEMORY : PERFORMANCE EVALUATION Jeong Seung Yu a, Woon Hak Kang b, Hwan Soo Han c and Sang Won Lee d School of Info. & Comm. Engr. Sungkyunkwan University
More informationAn Adaptive Task-Core Ratio Load Balancing Strategy for Multi-core Processors
An Adaptive Task-Core Ratio Load Balancing Strategy for Multi-core Processors Ian K. T. Tan, Member, IACSIT, Chai Ian, and Poo Kuan Hoong Abstract With the proliferation of multi-core processors in servers,
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationThis Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
More informationInside the Erlang VM
Rev A Inside the Erlang VM with focus on SMP Prepared by Kenneth Lundin, Ericsson AB Presentation held at Erlang User Conference, Stockholm, November 13, 2008 1 Introduction The history of support for
More informationLoad-Balanced Pipeline Parallelism
Load-Balanced Pipeline Parallelism Md Kamruzzaman Steven Swanson Dean M. Tullsen Computer Science and Engineering University of California, San Diego {mkamruzz,swanson,tullsen@cs.ucsd.edu ABSTRACT Accelerating
More informationVisualization Enables the Programmer to Reduce Cache Misses
Visualization Enables the Programmer to Reduce Cache Misses Kristof Beyls and Erik H. D Hollander and Yijun Yu Electronics and Information Systems Ghent University Sint-Pietersnieuwstraat 41, Gent, Belgium
More informationA Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization
A Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization Chun Chen, Jacqueline Chame, Mary Hall, and Kristina Lerman University of Southern California/Information Sciences
More informationpicojava TM : A Hardware Implementation of the Java Virtual Machine
picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer
More informationCenter for Programming Models for Scalable Parallel Computing
Overall Project Title: Coordinating PI: Subproject Title: PI: Reporting Period: Center for Programming Models for Scalable Parallel Computing Rusty Lusk, ANL Future Programming Models Guang R. Gao Final
More informationScheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
More informationA SURVEY ON MAPREDUCE IN CLOUD COMPUTING
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING Dr.M.Newlin Rajkumar 1, S.Balachandar 2, Dr.V.Venkatesakumar 3, T.Mahadevan 4 1 Asst. Prof, Dept. of CSE,Anna University Regional Centre, Coimbatore, newlin_rajkumar@yahoo.co.in
More informationThe Methodology of Application Development for Hybrid Architectures
Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department
More informationChapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
More informationPART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
More informationDesigning and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
More informationThe Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang
International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015) The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang Nanjing Communications
More informationDynamic Adaptive Feedback of Load Balancing Strategy
Journal of Information & Computational Science 8: 10 (2011) 1901 1908 Available at http://www.joics.com Dynamic Adaptive Feedback of Load Balancing Strategy Hongbin Wang a,b, Zhiyi Fang a,, Shuang Cui
More information