Scalability and Programmability in the Manycore Era

Size: px
Start display at page:

Download "Scalability and Programmability in the Manycore Era"

Transcription

1 Date: 2009/01/20 Scalability and Programmability in the Manycore Era A draft synopsis for an EU FP7 STREP proposal Mats Brorsson KTH Information and Communication Technology / Swedish Institute of Computer Science (SICS) matsbror@kth.se, Main Goals We propose to address a grand challenge for European software-intensive systems industry: To leverage multicore processors in the development of competitive products, and to take full advantage of the predicted technology evolution in a strategic perspective, from today s 4-64 cores, to 100s in five years, and 1000s in ten years (manycore). To leverage multicore and manycore, all software must be parallel. To be future-proof, the parallelism must be scalable. To be competitive, parallel programming must be performed with high software quality and productivity. The project proposal is written in the context of three years with explicit milestones for each year. Parallel programming present three fundamental problems to developers: (1) parallelism: subdividing computations into units of work that can be performed simultaneously by different processor cores, (2) scheduling: assigning units of work to specific cores, and (3) safety: ensuring that parallel units of work are properly coordinated and free from incorrect interactions. Current approaches to these problems are characterized by one or more of the following limitations: The computation is statically subdivided into few and coarse-grained units of work, e.g., threads, which do not scale to more cores. The subdivision works for certain classes of software, e.g., loop parallelism in scientific computations, and not for general software. The subdivision affects overall program structure, making it difficult and costly to apply to legacy software. The subdivision introduces a complex and nondeterministic control flow, exacerbating problems of debugging and performance management. Scheduling is performed at a coarse-grained level of work, e.g., OS processes and threads, and does not scale. Scheduling does not consider locality of memory and interconnections, and offers poor efficiency. Safety guarantees are only available for specific constructs, e.g., loops in numerical software or side-effect-free program fragments, and not for general parallel software. In effect, and as is generally recognized, current approaches to parallel programming are inadequate for leveraging multicore. We propose to remove these limitations and unleash the performance potential of multicore and manycore for European software-intensive industry through a systematic exploration and development of theory, methods, technology, and tools in support of a coherent framework: safe task-based parallel programming. This framework addresses all three fundamental problems of parallel programming, for new and legacy software, offering the hope and the promise of scalable performance, high software quality, and high developer productivity, and of becoming a future industry standard. We will ensure the direct relevance of our work to our industry partners and offer them a head start in the paradigm shift to multicore and manycore, through a project methodology centered on application patterns, which capture critical parallel programming challenges of industry partner systems products, in the application domains of industrial automation, telecommunications, aerospace, and crisis management. Application patterns will be at the center of project demonstrators, systems research, and disciplinary research.

2 Description of Project Manycore Technology 2018 According to the International Technology Roadmap for Semiconductors , the transition to multicore and manycore microprocessors will happen for portable and stationary embedded computer systems (Systems-on-chips, SoC) as well as for more traditional microprocessor-based systems, affecting the entire European software-intensive systems industry. The numbers of cores are expected to increase with 40% every year. Given that the state-of-the-art for commercial systems in 2007 was 4-64 cores, this translates to cores by 2013, and cores by Both ends of this scale are today considered to be massively parallel. Given these trends, we formulate a strategic technology vision which sets the scene for our research agenda. By 2018, Cores will be many. We are considering cores in our research. There will be non-uniformity. Cores will see non-uniform memory access times and exhibit non-uniform performance. The non-uniformity comes from process variations, memory architecture, and the need for adaptability for fine-grained power control. Locality is a major issue. Since on-chip storage will be distributed, it is important that computations take place close to the data. Also, communication needs to be localized. Off-chip bandwidth grows slower than aggregate core performance. This means that we need to make use of onchip storage as much as possible. Although Intel, IBM and others pursue 3D-packaging as a way to overcome the bandwidth limitation of going off-chip, this just postpones the inevitable. Cores are connected with point-to-point interconnection networks. Already today, only the smaller multicore processors use broadcast interconnects such as buses. Larger chips use packet switched interconnection networks. Communication between nearby cores will be cheaper than over longer distances. Power and thermal management are important. Manycore chips will be embedded in all kinds of products. To keep power and energy budgets, systems are needed to control power consumption and temperature at run-time. This is only a small list of expected characteristics of manycore processors and also a relatively conservative prediction as supported by the ITRS roadmap. In particular, we have not assumed the availability of hardware support for transactional memory or thread-level speculation although our approach can take advantage of them if and when they become available. Safe Task-Based Parallel Programming We will develop theory, methods, technology, and tools for a parallel programming model based on tasks. The task-based model is well presented, e.g., in the work by Leiserson et al on Cilk. We summarize its essential features. A task is a fine-grained unit of work that may be performed in parallel with other tasks. A program initially forms a single task. Tasks can create child tasks, which can execute in parallel with their parents. Tasks can wait for the termination of their descendant tasks. Tasks need not be performed in parallel. Any task-based program has a canonical sequential execution, where the tasks are performed in order, as in a sequential program. It is the responsibility of a scheduler to map the, in general, very numerous tasks to the fewer available processor cores. This decoupling of tasks and cores at the software-hardware interface gives the task model machine independence and allows software to adapt to a varying number of available cores. This allows for parallel code to be developed by several teams in a distributed fashion and for integration with external parallel code. Task based parallelism is compositional. A task-based program is safe if parallel execution gives the same functional semantics as sequential execution. Thus if each task is deterministic, any parallel execution will have exactly the same semantics as the unique sequential execution. Safety can be expressed as constraints, typically related to data dependencies, on task creation. Tasks fit very well with legacy code since they typically follow program structure. Thus parallelization is in general a local activity (in this procedure, these two statements, or this loop, can be executed in parallel) in contrast to the introduction of explicit threads, which tends to disrupt the program structure at a large scale. The task model described above is not tied to any particular programming language, nor is it unique to this proposal. For instance, Cilk and X10 are based on it. OpenMP is based on a mixture of tasks and threads and is evolving towards tasks 1 2

3 with version 3.0. In particular, an OpenMP parallel for loop is essentially a way to express a multi-way task creation. There are many possible concrete programming models for task based parallel programming, and part of the activity in this project is to find suitable constructs for expressing the parallelism in the application patterns. Approach We will address the challenges of parallel programming leveraging the safe task model as follows: Parallelism: We will study how to express application patterns that represent the systems of our industrial partners in the task model to maximize available parallelism. We will develop tools that reveal and measure potential task parallelism in legacy code. Scheduling: We will study locality and energy aware static and dynamic task scheduling algorithms as well as performance modeling tools extending the existing models with the effects of locality. We leverage the machine independence of the task model as an enabler for scalable parallelism. Safety: We will study safety for task based programs, and develop dynamic tools to test the task safeness condition and explore static tools that offer safety guarantees. Since the task model is language independent, we plan to work with extensions to widely used existing languages, starting from the extensions of C and C++ defined by Cilk, OpenMP and Intel TBB. This will be the context of the tools developed in the project, to the extent that the tools are at all language dependent. Project Structure The proposed research follows an iterative process revolving around work packages addressing the demonstrator, systems oriented research and disciplinary research. The demonstrator will strongly tie the different aspects of the project together in coherent systems of methods, technologies, and tools with the end goal to provide industrial-strength support for the safe task-based parallel programming methodology and tools for manycore processors with several hundreds of processors. The demonstrator is also used in the parallelization effort for application patterns representing systems products of our industry partners. Given the nature of the topic, all of the research in this proposal is systems oriented to a higher or lower degree. All partners have a strong systems oriented experience as shown in the research group section below. One purely systemsoriented work package is planned; the remaining work packages are predominantly disciplinary. The work packages are initially: 1. Application patterns Systems oriented research 2. Demonstrator Demonstrator 3. Performance modeling Disciplinary research 4. Constraint-aware Distributed Task Scheduling Disciplinary research 5. Dynamic analysis Disciplinary research 6. Type and effect systems for safety Disciplinary research In the work package descriptions below, we indicate the partner leading each work package, the expected outcomes and annual milestones. Given the nature of the type of strategic research, these milestones need to be revised every year, for which we will seek advice from an international advisory board. Although all work-packages of the project are interrelated, none is dependent on any other in order to start except for the demonstrator. Work in WP1, 3-6 will therefore start immediately and gradually over the project lifetime knowledge from all work-packages will be integrated in form of the demonstrator which is built and released two times per year. Experimental Methodology In order to evaluate our approach and results we need to test them on real and simulated parallel computers. We will use the simulator Simics, from the Swedish SICS spinoff Virtutech, augmented with architectural models such as GEMS/Ruby from the Wisconsin Multifacet project. As argued in WP3 (Performance Modeling Tools), simulators have a limit and we intend to follow the technology pace and each year acquire access the current state-of-the-art parallel systems to demonstrate the effectiveness of our results. Initially we will use smaller multicore platforms and a moderately large shared memory parallel computer at Uppsala University with 48 cores. Although this is not a system built out of multicore 3

4 processors, any good performance results will in general be on the safe side as communication costs will be much higher in this system relative processing speed than in any multicore processor. In order to demonstrate our approach on future non-uniform cache/memory/communication systems, other platforms have to be used. Given the length of the project, such systems are bound to appear sooner than later and will be used as experimentation and demonstration vehicles. The experimental platform is not directly part of the demonstrator, but used by it and by most other work packages. Work Package 1: Application Patterns The traditional methodology to test future computer systems has been to measure the systems performance (power, reliability, etc.) under the load of programs from established benchmark suites. This methodology has the inherent drawback that it predicts the performance of future technology with yesterday s workloads. As an alternative, researchers at UC Berkeley have proposed the use of application patterns from important problem domains to represent the workloads on future systems. At Berkeley, a pattern is referred to as a dwarf and is defined as an algorithmic method that captures a pattern of computation and communication. An application pattern can in its most simple form consist of a single piece of code implementing the core of an important functionality of an industrial partner s software. In the general case, an application pattern can be a relatively complex software system. The Berkeley application patterns are mostly from the numerical domain, and are not in general typical of the needs of the European softwareintensive systems industry, which is predominantly in the embedded domain. Therefore, a significant effort in this project will be devoted to the identification and formulation of patterns representative of the industrial systems of our partners and to explore task parallel versions of these patterns. Application patterns allow us to capture the essential parallelization challenges while being free of the IPR and secrecy problems associated with production code. This work package consists of two sub-tasks: (1) the identification, characterization, documentation, and iterative refinement of domain specific application patterns and (2) iteratively refined task-based parallelizations of the identified application patterns using the demonstrator from WP2. Both tasks involve close collaboration between the academic and industrial partners. We aim at performing parallelization using the demonstrator directly at the sites of our industrial partners, to evaluate both the programmer productivity and performance scalability aspects of the technology. An important aspect of this activity is that the knowledge transfer is bidirectional since we expect both the programming model and the supporting tools to evolve in response to the experience gained in their use. The identification of patterns in different domains will directly benefit all partners, as insight in different patterns will increase the awareness of how applications can be parallelized. Expected outcomes: A set of application patterns representing important software algorithms and designs for the industrial partners. Published reports describing and characterizing these application patterns from a parallelization perspective. Work Package 2: Demonstrator The project will demonstrate the development of safe and efficient parallel software for future multicore platforms based on project results using application patterns representing high value, high impact systems of the project partners. The project demonstrator is a coherent system of methods, technologies, and tools emanating from the disciplinary work packages, together with coding guidelines evolving with the demonstrator, and is applied to application patterns from the systems work package: Application patterns Performance modeling tools Constraint-aware and distributed task scheduling algorithms Dynamic analysis tools Tools for safeness checking and safe parallelization Coding guidelines The application of the demonstrator can be started at any phase of the development of a task-based parallel program. Assuming that we are given a sequential pattern, we can use dynamic analysis to identify potential sources of parallelism to be annotated with tasks according to the coding guidelines, then perform safeness checking of the resulting task parallel 4

5 pattern, analyze and predict performance on different numbers of cores, and finally verify the results using an execution platform based on scalable task scheduling. The analysis results together with the coding guidelines will steer the programmer to more easily maintained code with higher performance. Debugging, although an important task, is not part of this project proposal. As we aim for safe parallelism, debugging the parallel program is fundamentally equivalent to debugging the underlying sequential program. The safe parallel programming model guarantees that no new bugs are introduced in the parallelization effort. (For debugging to scale, the debugging execution should be parallel, but with the same external behavior as sequential execution.) The demonstrator includes the development of a set of coding guidelines which, in conjunction with the tools and task schedulers, will lead to easy development of safely parallel programs. The demonstrator will be developed in an iterative process with two annual full integrations, providing more functionality and coverage by each generation. Expected outcomes: A coherent set of software analyzers, task-schedulers, run-time systems and coding guidelines. Status reports and manuals describing the demonstrator on a yearly basis. Work Package 3: Performance Modeling Tools The current state-of-the-art for assessing performance and other characteristics of future multicore systems and of general multiprocessors is to use simulation. A model of the system under study is designed, typically as a computer program, and a workload is run on the simulated system. In the most detailed simulation models, processor internals as well as the memory hierarchy and interconnect design are modeled in great detail making the simulator exceptionally slow. Although it is accepted to relax on the details when it comes to the processor internals, it has been shown that small changes in the model of the memory system may have significant effects on the system behavior resulting in the need for multiple simulations with small random variations in e.g. main memory latency, increasing the simulation time. On the other hand, task based parallelism has a simple performance model relating expected execution time T(n) on n processors to the sequential T(1) and the unboundedly parallel execution T( ); T(n) < T( ) + T(1)/n. This model does, however, not take locality or memory hierarchy issues into account. Future-proof performance estimations must not rely on detailed simulation while still being able to take important architectural characteristics such as memory hierarchy, number of cores and locality into account. We therefore need modeling techniques that capture the inherent behavior of programs and which can extrapolate the information to the use of more cores. We propose to extend the current techniques and models to support the requested extrapolation capability. Besides statistical sampling of memory references from a sample execution, these models are expected to make use of data collected by the run-time scheduler (WP4), the dynamic analyzer (WP5) and the type-and-effect system (WP6). Expected outcomes: Prototype tools extending StatCache-MP with predictive capabilities and low-overhead sampling. Published patents, reports and dissertations describing this work. Work Package 4: Constraint-Aware and Distributed Task Scheduling Almost all available work on task scheduling for multicore processors refer to static task scheduling where tasks are mapped to processors (cores) at compile-time or when the tasks are created. This is neither flexible enough nor capable of meeting the challenges we are facing. Examples of dynamic task scheduling are Cilk, Intel Threading Building Blocks and parts of OpenMP version 3.0. In these models, threads are used as workers that execute tasks from a pool of tasks. Typically one thread is started per core. In Cilk and TBB, the scheduling of tasks onto worker threads is distributed by means of a task stealing algorithm where a worker thread steals tasks from another thread s task pool at specific thread synchronization points. In OpenMP, the algorithm is implementation dependent and more control over scheduling is available to the programmer if desired. The schedulers currently available for these systems do not scale well to future manycore systems. The Intel TBB scheduler is inherently coarse grained as it is entirely invoked as a run-time library. In contrast, the Cilk and OpenMP schedulers can be implemented partly as a library routine and partly by controlling the output from the compiler. Still the Cilk scheduler, which is a randomized task stealing algorithm, is totally oblivious of locality and the non-uniformity aspects we will experience in the future. 5

6 We will in this work package extend current state-of-the-art by introducing a scalable, constraint-aware task scheduler implemented in existing prototype systems for task-based parallelism. Some of the constraints and requirements on the task scheduler that we need to consider are: Tasks should be scheduled on cores where its data are located. Tasks should be scheduled on cores close to other cores where tasks it needs to communicate with are located. The scheduling algorithm must be distributed in order to scale with the increase in cores. The scheduler should interact with power and thermal management systems which typically adapt the number of available cores to match power and temperature constraints. One main goal of this work package is to define a task model that permits as fine-grain parallelism as possible. A scheduler entirely implemented in a run-time library may be used to schedule tasks with a granularity on the order of thousands of instructions. A compiler-supported scheduler may improve this to tasks with a granularity on the order of hundreds of instructions. Finally, we may envisage hardware supported schedulers that can support a granularity as small as on the order of tens of instructions. Also, the scheduling model should be both flexible and predictive in order to achieve the best parallelism and be possible to model. This work will be performed in three steps. First, we will explore dynamic scheduling algorithms, taking application fingerprints as input and dynamically rescheduling tasks to cores of a non-uniform architecture while minimizing the performance impact of task interaction, shared cache usage and memory interface pressure. Fingerprints will be collected during short time batches and new scheduling decisions will be made in a semi-dynamic fashion to minimize the overhead. Several scheduling alternative will be evaluated, such as fair-share, maximum throughput, and minimum off-chip bandwidth. This scheme will work well for scheduling of coarse grain tasks, but will not be a good fit for activities with a short execution time. One sub-goal is to develop algorithms capable of predicting such sharing between independent applications, for which a fingerprint has been captured in isolation. Oracle scheduling will be our second step towards a scheduler supporting more short-running activities. We will measure dynamic information about the activity at runtime and determine what would have been the best scheduling decision at dispatch time, had we known this information then. Based on this oracle information, scheduling alternative similar to those mentioned will be evaluated. Our third step is historic scheduling. The historic scheduling activity aims at developing heuristics and methods for predicting the use of resources shared by several concurrent activities based on past history. We will develop heuristics that allow us to predict good scheduling at dispatch time based on past history. For example: the last time this user started a job with these parameters it resulted in a specific performance fingerprint. The scheduling decision at dispatch time is based on the assumption of a similar fingerprint this time. In a similar way, tasks created in certain manner will be assumed to have a similar performance fingerprint as their reminiscent activities of the past. Expected outcomes: Prototype constraint-aware and distributed schedulers incorporated in existing compilers and runtime systems such as the ones for OpenMP (gcc version 4.4), Cilk-5 or Intel TBB. Published reports and dissertations. Patents, where applicable. Work Package 5: Dynamic Analysis Dynamic off-line dependence analysis (DDA) is a powerful approach to finding useful potential parallelism in sequential parts of legacy code as well as checking safeness of task parallel code. It is based on observing the dependencies that occur in a running program. This can be accomplished by instrumenting the program itself or running it under an instrumented emulator (our current prototype uses Valgrind). The dependencies are collected in dedicated analysis runs, so the overhead does not affect the performance of production runs. Since DDA is based on execution, it is highly language independent and applicable to programs written in a combination of languages and where source code is only available for parts of the program. It is also exact, in contrast to static analysis that must always be conservative, so it can be used to find places where static analysis will over-approximate the dependencies. DDA is applied either to sequential parts of legacy code with the aim of uncovering potential parallelism or to the sequential reading of task parallel code to ensure safeness. Since DDA observes particular executions, its results are not guaranteed to hold for all possible executions. However, preliminary results indicate that selecting inputs that cause the entire program to be executed typically reveals all dependencies. If the analyzed program is deterministic, dependencies 6

7 can only be missed due to an insufficient set of test inputs and the problem can immediately be identified by rerunning analysis with the offending input. Thus with DDA, all debugging takes place in a sequential setting. In this work package we will investigate DDA, producing a complete tool applicable to industrial software. Issues to tackle are DDA for multi threaded code (with the objective of parallelizing individual threads), constructing models for predicting the confidence with which one can generalize the absence of dependencies for the test inputs to all inputs, combining the instrumented approach (which collects all dependencies) with hardware based statistical sampling techniques with several orders of magnitude lower overhead, and the extension to a parallelization support tool giving advice on the most profitable parallelization opportunities. Expected outcomes: Tools for parallelization support and safeness checking based on dynamic dependence analysis. Published reports and dissertations. Work Package 6: Type and Effect Systems for Safety We will here develop theory, methods, and tools for verifying the safety of a task-based parallel program. A safe task-based parallel program is guaranteed to have the same semantics as the underlying sequential program. This will eliminate parallel programming as an additional source of race condition software defects, with great impact on quality and productivity. Our approach is to extend the state-of-the-art in type and effect systems to express and derive constraints on the allowed data dependencies between tasks. This is analogous to data type systems, which constrain the possible values of variables. Where data types are associated with variables, effect types capture the side effects that may arise from the execution of a statement. Effects are reads and writes to regions, which represent static division of the address space into disjoint subsets. Thus if a task writes to a region r, then no parallel task may either read or write to the same region r. We believe that formulating static dependence analysis as a type system has several advantages. Type systems have strong compositional properties; the information about a program fragment is represented in its (annotated) type, and once that type has been determined, the fragment itself need not be analyzed further. This property enables seamless support for inter-procedural analysis, even in the presence of first class procedures, as well as modular analysis. In contrast, non type based static analysis often requires the availability of the entire source code. Programs need not contain explicit type information since a type inference algorithm can reconstruct the types that programs have to have to make the program correct. Static analysis strikes a balance between the precision and performance. Traditionally in automatic parallelization, the designer of the analyzer makes that tradeoff. One of the attractive features of type inference in the context of safe parallelism is that a parallel programmer can use type annotations to guide the analyzer. Type annotations can be arbitrarily precise, while still checkable by the analyzer. Thus, the safety of a parallel program can be verified that would never have been generated by an automatic parallelizer. We will in this work formalize and investigate the theoretical properties of a series of type systems allowing for progressively more powerful parallel patterns to be type checked, an activity that will advance the state-of-the-art in polymorphism, effect types and dependent types. We will also implement type inference tools that will automatically check program safety. The theoretical results will be applicable to existing explicit parallel programming environments such as X10, OpenMP or Cilk and the tools will work with at least one of these languages. Expected outcomes: Prototype tools for parallelization support and safeness checking based on effect typing. Published reports and dissertations. Contribution to competitiveness The multicore revolution is forcing a paradigm shift onto the software industry. No longer will increasing clock frequency in new generations of microprocessor hardware drive a commensurate increase in the performance of sequential programs. To leverage the performance increases of future hardware generations, all software must be parallel. To be future-proof, the parallel performance must scale with the number of cores. To be competitive, the added requirement of parallelism must not hamper engineering efficiency. 7

8 This affects systems industry dramatically since it is highly software-intensive. For instance, Ericsson s CEO recently described the company as the world s fifth largest software company. About 80% of the development costs can be attributed to software. Software represents over 50% of development costs in several other companies. At the same time, only 1% of developers are familiar with parallel programming and the development cost of parallel software is 2-3 times higher than for normal software. Mastering this grand challenge will require a paradigm shift for software development, in Europe as in the rest of the world. A race has begun for new methods, technologies, and tools. This situation has been recognized all over the world and has motivated several research initiatives. In particular, the Computing Systems call in the EU Seventh Frame Program addresses this issue, as does the ARTEMIS technology platform. In the US, the computer industry has formed strategic partnerships with academic institutions, with the expectation to obtain a competitive advantage. Microsoft and Intel are spending $20M over five years at the University of California at Berkeley and the University of Illinois at Urbana-Champaign. AMD, HP, Intel, NVidia, and Sun are spending $6M over three years at Stanford University. There appears to be no simple technical fix for the problem. While techniques such as automatic parallelization, transactional memory and thread level speculation are attracting interest, we believe that they are short term solutions that do not scale to hundreds or thousands of cores. This level of parallelism must be designed into the code from the start, which necessitates the evolution of a programming model that forms a new division of responsibility between programmers and machines. Task parallelism is emerging as one of the main contenders as the future of parallel programming and is at the heart of several important technologies, such as OpenMP 3.0, Cilk, the Intel TBB, and as of recently also the.net platform. We believe that our proposal adds the following unique features: 1. Dynamic dependence analysis is a language independent method for discovering potential and profitable parallelism in new and legacy code. 2. The use of type and effect systems for the vital role of ensuring safeness has the attractive property that types in most cases can be derived automatically, without any type annotations in the program, while still allowing for type annotations as a way to guide the safeness checker in especially tricky situations. 3. User level schedulers adapting to run-time locality information have the potential to efficiently exploit inter-task locality, which is crucial since tasks are too fine grained to have enough internal locality. 4. Task based modeling of memory system performance will improve the predictive power of the current models which only take the number of operations and length of critical path into account. In a recent list of priorities of parallel programmers collected by Petersen, Robinson, Leasure and Mattson, the first three items were: 1. Finding concurrent tasks in a program, both for legacy code and code written from scratch. 2. Scheduling tasks at the right granularity onto the processors of a parallel machine. 3. The data locality problem: Associating data with tasks. This proposal directly addresses these top priority tasks. Potential Consortium A proposal for a consortium could include the following identified partners: KTH (Royal Institute of Technology), Manycore architectures, operating system scheduling algorithms of hw resources as well as threads, parallel programming models SICS (Swedish Institute of Computer Science), Tools for data dependence analysis, run-time systems for taskbased parallelism, type and effect systems Ericsson AB, Application provider and software developer. Your expertise here? 8

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

How To Build A Cloud Computer

How To Build A Cloud Computer Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Driving force. What future software needs. Potential research topics

Driving force. What future software needs. Potential research topics Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #

More information

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Scheduling Task Parallelism on Multi-Socket Multicore Systems Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction

More information

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

More information

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

A Locality Approach to Architecture-aware Task-scheduling in OpenMP

A Locality Approach to Architecture-aware Task-scheduling in OpenMP A Locality Approach to Architecture-aware Task-scheduling in OpenMP Ananya Muddukrishna ananya@kth.se Mats Brorsson matsbror@kth.se Vladimir Vlassov vladv@kth.se ABSTRACT Multicore and other parallel computer

More information

BY STEVE BROWN, CADENCE DESIGN SYSTEMS AND MICHEL GENARD, VIRTUTECH

BY STEVE BROWN, CADENCE DESIGN SYSTEMS AND MICHEL GENARD, VIRTUTECH WHITE PAPER METRIC-DRIVEN VERIFICATION ENSURES SOFTWARE DEVELOPMENT QUALITY BY STEVE BROWN, CADENCE DESIGN SYSTEMS AND MICHEL GENARD, VIRTUTECH INTRODUCTION The complexity of electronic systems is rapidly

More information

Software and the Concurrency Revolution

Software and the Concurrency Revolution Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )

More information

Virtual Platforms Addressing challenges in telecom product development

Virtual Platforms Addressing challenges in telecom product development white paper Virtual Platforms Addressing challenges in telecom product development This page is intentionally left blank. EXECUTIVE SUMMARY Telecom Equipment Manufacturers (TEMs) are currently facing numerous

More information

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

More information

Advanced Core Operating System (ACOS): Experience the Performance

Advanced Core Operating System (ACOS): Experience the Performance WHITE PAPER Advanced Core Operating System (ACOS): Experience the Performance Table of Contents Trends Affecting Application Networking...3 The Era of Multicore...3 Multicore System Design Challenges...3

More information

BSC vision on Big Data and extreme scale computing

BSC vision on Big Data and extreme scale computing BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,

More information

Transactional Memory

Transactional Memory Transactional Memory Konrad Lai Microprocessor Technology Labs, Intel Intel Multicore University Research Conference Dec 8, 2005 Motivation Multiple cores face a serious programmability problem Writing

More information

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102

More information

Improving System Scalability of OpenMP Applications Using Large Page Support

Improving System Scalability of OpenMP Applications Using Large Page Support Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

Course Development of Programming for General-Purpose Multicore Processors

Course Development of Programming for General-Purpose Multicore Processors Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 wzhang4@vcu.edu

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

Resource Utilization of Middleware Components in Embedded Systems

Resource Utilization of Middleware Components in Embedded Systems Resource Utilization of Middleware Components in Embedded Systems 3 Introduction System memory, CPU, and network resources are critical to the operation and performance of any software system. These system

More information

Hardware Supported Flexible Monitoring: Early Results

Hardware Supported Flexible Monitoring: Early Results Hardware Supported Flexible Monitoring: Early Results Antonia Zhai, Guojin He, and Mats P.E. Heimdahl University of Minnesota zhai@cs.umn.edu guojinhe@cs.umn.edu heimdahl@cs.umn.edu Abstract. Monitoring

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

1. PUBLISHABLE SUMMARY

1. PUBLISHABLE SUMMARY 1. PUBLISHABLE SUMMARY ICT-eMuCo (www.emuco.eu) is a European project with a total budget of 4.6M which is supported by the European Union under the Seventh Framework Programme (FP7) for research and technological

More information

Improving the performance of data servers on multicore architectures. Fabien Gaud

Improving the performance of data servers on multicore architectures. Fabien Gaud Improving the performance of data servers on multicore architectures Fabien Gaud Grenoble University Advisors: Jean-Bernard Stefani, Renaud Lachaize and Vivien Quéma Sardes (INRIA/LIG) December 2, 2010

More information

Multicore Programming with LabVIEW Technical Resource Guide

Multicore Programming with LabVIEW Technical Resource Guide Multicore Programming with LabVIEW Technical Resource Guide 2 INTRODUCTORY TOPICS UNDERSTANDING PARALLEL HARDWARE: MULTIPROCESSORS, HYPERTHREADING, DUAL- CORE, MULTICORE AND FPGAS... 5 DIFFERENCES BETWEEN

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached

More information

Research Statement Immanuel Trummer www.itrummer.org

Research Statement Immanuel Trummer www.itrummer.org Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs Presenter: Bo Zhang Yulin Shi Outline Motivation & Goal Solution - DACOTA overview Technical Insights Experimental Evaluation

More information

Thread level parallelism

Thread level parallelism Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

FPGA area allocation for parallel C applications

FPGA area allocation for parallel C applications 1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University

More information

Low-Overhead Hard Real-time Aware Interconnect Network Router

Low-Overhead Hard Real-time Aware Interconnect Network Router Low-Overhead Hard Real-time Aware Interconnect Network Router Michel A. Kinsy! Department of Computer and Information Science University of Oregon Srinivas Devadas! Department of Electrical Engineering

More information

The ROI from Optimizing Software Performance with Intel Parallel Studio XE

The ROI from Optimizing Software Performance with Intel Parallel Studio XE The ROI from Optimizing Software Performance with Intel Parallel Studio XE Intel Parallel Studio XE delivers ROI solutions to development organizations. This comprehensive tool offering for the entire

More information

MCA Standards For Closely Distributed Multicore

MCA Standards For Closely Distributed Multicore MCA Standards For Closely Distributed Multicore Sven Brehmer Multicore Association, cofounder, board member, and MCAPI WG Chair CEO of PolyCore Software 2 Embedded Systems Spans the computing industry

More information

Working Together to Promote Business Innovations with Grid Computing

Working Together to Promote Business Innovations with Grid Computing IBM and SAS Working Together to Promote Business Innovations with Grid Computing A SAS White Paper Table of Contents Executive Summary... 1 Grid Computing Overview... 1 Benefits of Grid Computing... 1

More information

Software Development around a Millisecond

Software Development around a Millisecond Introduction Software Development around a Millisecond Geoffrey Fox In this column we consider software development methodologies with some emphasis on those relevant for large scale scientific computing.

More information

Data Centric Systems (DCS)

Data Centric Systems (DCS) Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems

More information

Concept and Project Objectives

Concept and Project Objectives 3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

International Workshop on Field Programmable Logic and Applications, FPL '99

International Workshop on Field Programmable Logic and Applications, FPL '99 International Workshop on Field Programmable Logic and Applications, FPL '99 DRIVE: An Interpretive Simulation and Visualization Environment for Dynamically Reconægurable Systems? Kiran Bondalapati and

More information

Rambus Smart Data Acceleration

Rambus Smart Data Acceleration Rambus Smart Data Acceleration Back to the Future Memory and Data Access: The Final Frontier As an industry, if real progress is to be made towards the level of computing that the future mandates, then

More information

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper Four Keys to Successful Multicore Optimization for Machine Vision White Paper Optimizing a machine vision application for multicore PCs can be a complex process with unpredictable results. Developers need

More information

Design Cycle for Microprocessors

Design Cycle for Microprocessors Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Module 10. Coding and Testing. Version 2 CSE IIT, Kharagpur

Module 10. Coding and Testing. Version 2 CSE IIT, Kharagpur Module 10 Coding and Testing Lesson 23 Code Review Specific Instructional Objectives At the end of this lesson the student would be able to: Identify the necessity of coding standards. Differentiate between

More information

Energy-Efficient, High-Performance Heterogeneous Core Design

Energy-Efficient, High-Performance Heterogeneous Core Design Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

More information

Windows Server Virtualization An Overview

Windows Server Virtualization An Overview Microsoft Corporation Published: May 2006 Abstract Today s business climate is more challenging than ever and businesses are under constant pressure to lower costs while improving overall operational efficiency.

More information

BLM 413E - Parallel Programming Lecture 3

BLM 413E - Parallel Programming Lecture 3 BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several

More information

evm Virtualization Platform for Windows

evm Virtualization Platform for Windows B A C K G R O U N D E R evm Virtualization Platform for Windows Host your Embedded OS and Windows on a Single Hardware Platform using Intel Virtualization Technology April, 2008 TenAsys Corporation 1400

More information

Using Predictive Adaptive Parallelism to Address Portability and Irregularity

Using Predictive Adaptive Parallelism to Address Portability and Irregularity Using Predictive Adaptive Parallelism to Address Portability and Irregularity avid L. Wangerin and Isaac. Scherson {dwangeri,isaac}@uci.edu School of Computer Science University of California, Irvine Irvine,

More information

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010 Flash Memory Arrays Enabling the Virtualized Data Center July 2010 2 Flash Memory Arrays Enabling the Virtualized Data Center This White Paper describes a new product category, the flash Memory Array,

More information

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1 AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL AMD Embedded Solutions 1 Optimizing Parallel Processing Performance and Coding Efficiency with AMD APUs and Texas Multicore Technologies SequenceL Auto-parallelizing

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

On some Potential Research Contributions to the Multi-Core Enterprise

On some Potential Research Contributions to the Multi-Core Enterprise On some Potential Research Contributions to the Multi-Core Enterprise Oded Maler CNRS - VERIMAG Grenoble, France February 2009 Background This presentation is based on observations made in the Athole project

More information

Managing Data Center Power and Cooling

Managing Data Center Power and Cooling White PAPER Managing Data Center Power and Cooling Introduction: Crisis in Power and Cooling As server microprocessors become more powerful in accordance with Moore s Law, they also consume more power

More information

Software Engineering Reference Framework

Software Engineering Reference Framework Software Engineering Reference Framework Michel Chaudron, Jan Friso Groote, Kees van Hee, Kees Hemerik, Lou Somers, Tom Verhoeff. Department of Mathematics and Computer Science Eindhoven University of

More information

Hardware/Software Co-Design of a Java Virtual Machine

Hardware/Software Co-Design of a Java Virtual Machine Hardware/Software Co-Design of a Java Virtual Machine Kenneth B. Kent University of Victoria Dept. of Computer Science Victoria, British Columbia, Canada ken@csc.uvic.ca Micaela Serra University of Victoria

More information

Asymmetry Everywhere (with Automatic Resource Management) Onur Mutlu onur@cmu.edu

Asymmetry Everywhere (with Automatic Resource Management) Onur Mutlu onur@cmu.edu Asymmetry Everywhere (with Automatic Resource Management) Onur Mutlu onur@cmu.edu The Setting Hardware resources are shared among many threads/apps in a data center (or peta-scale) system Sockets, cores,

More information

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03

More information

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A Fine-Grained Adaptive

More information

159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354

159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354 159.735 Final Report Cluster Scheduling Submitted by: Priti Lohani 04244354 1 Table of contents: 159.735... 1 Final Report... 1 Cluster Scheduling... 1 Table of contents:... 2 1. Introduction:... 3 1.1

More information

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operatin g Systems: Internals and Design Principle s Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Bear in mind,

More information

Chapter 2: OS Overview

Chapter 2: OS Overview Chapter 2: OS Overview CmSc 335 Operating Systems 1. Operating system objectives and functions Operating systems control and support the usage of computer systems. a. usage users of a computer system:

More information

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell R&D Manager, Scalable System So#ware Department Sandia National Laboratories is a multi-program laboratory managed and

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

White Paper Abstract Disclaimer

White Paper Abstract Disclaimer White Paper Synopsis of the Data Streaming Logical Specification (Phase I) Based on: RapidIO Specification Part X: Data Streaming Logical Specification Rev. 1.2, 08/2004 Abstract The Data Streaming specification

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Software Development Under Stringent Hardware Constraints: Do Agile Methods Have a Chance?

Software Development Under Stringent Hardware Constraints: Do Agile Methods Have a Chance? Software Development Under Stringent Hardware Constraints: Do Agile Methods Have a Chance? Jussi Ronkainen, Pekka Abrahamsson VTT Technical Research Centre of Finland P.O. Box 1100 FIN-90570 Oulu, Finland

More information

Building Scalable Applications Using Microsoft Technologies

Building Scalable Applications Using Microsoft Technologies Building Scalable Applications Using Microsoft Technologies Padma Krishnan Senior Manager Introduction CIOs lay great emphasis on application scalability and performance and rightly so. As business grows,

More information

How To Create A Concurrent Cloud Computing System

How To Create A Concurrent Cloud Computing System THROUGHPUTER PaaS for creating and executing concurrent cloud applications OVERVIEW 1) Fundamental transformation in computing: Concurrent apps on dynamically shared resources Micro-services: unpredictable

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University

Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University Automated Software Testing of Memory Performance in Embedded GPUs Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University 1 State-of-the-art in Detecting Performance Loss Input Program profiling

More information

Scalability evaluation of barrier algorithms for OpenMP

Scalability evaluation of barrier algorithms for OpenMP Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science

More information

Studying Code Development for High Performance Computing: The HPCS Program

Studying Code Development for High Performance Computing: The HPCS Program Studying Code Development for High Performance Computing: The HPCS Program Jeff Carver 1, Sima Asgari 1, Victor Basili 1,2, Lorin Hochstein 1, Jeffrey K. Hollingsworth 1, Forrest Shull 2, Marv Zelkowitz

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

The Truth Behind IBM AIX LPAR Performance

The Truth Behind IBM AIX LPAR Performance The Truth Behind IBM AIX LPAR Performance Yann Guernion, VP Technology EMEA HEADQUARTERS AMERICAS HEADQUARTERS Tour Franklin 92042 Paris La Défense Cedex France +33 [0] 1 47 73 12 12 info@orsyp.com www.orsyp.com

More information

Optimizing Configuration and Application Mapping for MPSoC Architectures

Optimizing Configuration and Application Mapping for MPSoC Architectures Optimizing Configuration and Application Mapping for MPSoC Architectures École Polytechnique de Montréal, Canada Email : Sebastien.Le-Beux@polymtl.ca 1 Multi-Processor Systems on Chip (MPSoC) Design Trends

More information

A Review of Customized Dynamic Load Balancing for a Network of Workstations

A Review of Customized Dynamic Load Balancing for a Network of Workstations A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester

More information

Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps

Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps Ada Gavrilovska, Hsien-Hsin-Lee, Karsten Schwan, Sudha Yalamanchili, Matt Wolf CERCS Georgia Institute of Technology Background

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1.1 Background The command over cloud computing infrastructure is increasing with the growing demands of IT infrastructure during the changed business scenario of the 21 st Century.

More information

Parallelism and Cloud Computing

Parallelism and Cloud Computing Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication

More information

Oracle Solaris Studio Code Analyzer

Oracle Solaris Studio Code Analyzer Oracle Solaris Studio Code Analyzer The Oracle Solaris Studio Code Analyzer ensures application reliability and security by detecting application vulnerabilities, including memory leaks and memory access

More information

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva robert.geva@intel.com Introduction Intel

More information

Design Considerations for Network Processor Operating Systems

Design Considerations for Network Processor Operating Systems Design Considerations for Network rocessor Operating Systems Tilman Wolf, Ning Weng, and Chia-Hui Tai Dept. of Electrical and Computer Engineering University of Massachusetts Amherst, MA, USA {wolf,nweng,ctai}@ecs.umass.edu

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Software Distributed Shared Memory Scalability and New Applications

Software Distributed Shared Memory Scalability and New Applications Software Distributed Shared Memory Scalability and New Applications Mats Brorsson Department of Information Technology, Lund University P.O. Box 118, S-221 00 LUND, Sweden email: Mats.Brorsson@it.lth.se

More information

School of Computer Science

School of Computer Science School of Computer Science Computer Science - Honours Level - 2014/15 October 2014 General degree students wishing to enter 3000- level modules and non- graduating students wishing to enter 3000- level

More information

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Intel Data Direct I/O Technology (Intel DDIO): A Primer > Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

More information

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1 Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems

More information

Scheduling and Resource Management in Computational Mini-Grids

Scheduling and Resource Management in Computational Mini-Grids Scheduling and Resource Management in Computational Mini-Grids July 1, 2002 Project Description The concept of grid computing is becoming a more and more important one in the high performance computing

More information