Scalability and Programmability in the Manycore Era

Date: 2009/01/20 Scalability and Programmability in the Manycore Era A draft synopsis for an EU FP7 STREP proposal Mats Brorsson KTH Information and Communication Technology / Swedish Institute of Computer Science (SICS) matsbror@kth.se, http://www.sics.se/multicore Main Goals We propose to address a grand challenge for European software-intensive systems industry: To leverage multicore processors in the development of competitive products, and to take full advantage of the predicted technology evolution in a strategic perspective, from today s 4-64 cores, to 100s in five years, and 1000s in ten years (manycore). To leverage multicore and manycore, all software must be parallel. To be future-proof, the parallelism must be scalable. To be competitive, parallel programming must be performed with high software quality and productivity. The project proposal is written in the context of three years with explicit milestones for each year. Parallel programming present three fundamental problems to developers: (1) parallelism: subdividing computations into units of work that can be performed simultaneously by different processor cores, (2) scheduling: assigning units of work to specific cores, and (3) safety: ensuring that parallel units of work are properly coordinated and free from incorrect interactions. Current approaches to these problems are characterized by one or more of the following limitations: The computation is statically subdivided into few and coarse-grained units of work, e.g., threads, which do not scale to more cores. The subdivision works for certain classes of software, e.g., loop parallelism in scientific computations, and not for general software. The subdivision affects overall program structure, making it difficult and costly to apply to legacy software. The subdivision introduces a complex and nondeterministic control flow, exacerbating problems of debugging and performance management. Scheduling is performed at a coarse-grained level of work, e.g., OS processes and threads, and does not scale. Scheduling does not consider locality of memory and interconnections, and offers poor efficiency. Safety guarantees are only available for specific constructs, e.g., loops in numerical software or side-effect-free program fragments, and not for general parallel software. In effect, and as is generally recognized, current approaches to parallel programming are inadequate for leveraging multicore. We propose to remove these limitations and unleash the performance potential of multicore and manycore for European software-intensive industry through a systematic exploration and development of theory, methods, technology, and tools in support of a coherent framework: safe task-based parallel programming. This framework addresses all three fundamental problems of parallel programming, for new and legacy software, offering the hope and the promise of scalable performance, high software quality, and high developer productivity, and of becoming a future industry standard. We will ensure the direct relevance of our work to our industry partners and offer them a head start in the paradigm shift to multicore and manycore, through a project methodology centered on application patterns, which capture critical parallel programming challenges of industry partner systems products, in the application domains of industrial automation, telecommunications, aerospace, and crisis management. Application patterns will be at the center of project demonstrators, systems research, and disciplinary research.

Description of Project Manycore Technology 2018 According to the International Technology Roadmap for Semiconductors 2007 1, the transition to multicore and manycore microprocessors will happen for portable and stationary embedded computer systems (Systems-on-chips, SoC) as well as for more traditional microprocessor-based systems, affecting the entire European software-intensive systems industry. The numbers of cores are expected to increase with 40% every year. Given that the state-of-the-art for commercial systems in 2007 was 4-64 cores, this translates to 30-480 cores by 2013, and 160-2560 cores by 2018. Both ends of this scale are today considered to be massively parallel. Given these trends, we formulate a strategic technology vision which sets the scene for our research agenda. By 2018, Cores will be many. We are considering 100-1000+ cores in our research. There will be non-uniformity. Cores will see non-uniform memory access times and exhibit non-uniform performance. The non-uniformity comes from process variations, memory architecture, and the need for adaptability for fine-grained power control. Locality is a major issue. Since on-chip storage will be distributed, it is important that computations take place close to the data. Also, communication needs to be localized. Off-chip bandwidth grows slower than aggregate core performance. This means that we need to make use of onchip storage as much as possible. Although Intel, IBM and others pursue 3D-packaging as a way to overcome the bandwidth limitation of going off-chip, this just postpones the inevitable. Cores are connected with point-to-point interconnection networks. Already today, only the smaller multicore processors use broadcast interconnects such as buses. Larger chips use packet switched interconnection networks. Communication between nearby cores will be cheaper than over longer distances. Power and thermal management are important. Manycore chips will be embedded in all kinds of products. To keep power and energy budgets, systems are needed to control power consumption and temperature at run-time. This is only a small list of expected characteristics of manycore processors and also a relatively conservative prediction as supported by the ITRS roadmap. In particular, we have not assumed the availability of hardware support for transactional memory or thread-level speculation although our approach can take advantage of them if and when they become available. Safe Task-Based Parallel Programming We will develop theory, methods, technology, and tools for a parallel programming model based on tasks. The task-based model is well presented, e.g., in the work by Leiserson et al on Cilk. We summarize its essential features. A task is a fine-grained unit of work that may be performed in parallel with other tasks. A program initially forms a single task. Tasks can create child tasks, which can execute in parallel with their parents. Tasks can wait for the termination of their descendant tasks. Tasks need not be performed in parallel. Any task-based program has a canonical sequential execution, where the tasks are performed in order, as in a sequential program. It is the responsibility of a scheduler to map the, in general, very numerous tasks to the fewer available processor cores. This decoupling of tasks and cores at the software-hardware interface gives the task model machine independence and allows software to adapt to a varying number of available cores. This allows for parallel code to be developed by several teams in a distributed fashion and for integration with external parallel code. Task based parallelism is compositional. A task-based program is safe if parallel execution gives the same functional semantics as sequential execution. Thus if each task is deterministic, any parallel execution will have exactly the same semantics as the unique sequential execution. Safety can be expressed as constraints, typically related to data dependencies, on task creation. Tasks fit very well with legacy code since they typically follow program structure. Thus parallelization is in general a local activity (in this procedure, these two statements, or this loop, can be executed in parallel) in contrast to the introduction of explicit threads, which tends to disrupt the program structure at a large scale. The task model described above is not tied to any particular programming language, nor is it unique to this proposal. For instance, Cilk and X10 are based on it. OpenMP is based on a mixture of tasks and threads and is evolving towards tasks 1 http://www.itrs.net/links/2007itrs/home2007.htm 2

with version 3.0. In particular, an OpenMP parallel for loop is essentially a way to express a multi-way task creation. There are many possible concrete programming models for task based parallel programming, and part of the activity in this project is to find suitable constructs for expressing the parallelism in the application patterns. Approach We will address the challenges of parallel programming leveraging the safe task model as follows: Parallelism: We will study how to express application patterns that represent the systems of our industrial partners in the task model to maximize available parallelism. We will develop tools that reveal and measure potential task parallelism in legacy code. Scheduling: We will study locality and energy aware static and dynamic task scheduling algorithms as well as performance modeling tools extending the existing models with the effects of locality. We leverage the machine independence of the task model as an enabler for scalable parallelism. Safety: We will study safety for task based programs, and develop dynamic tools to test the task safeness condition and explore static tools that offer safety guarantees. Since the task model is language independent, we plan to work with extensions to widely used existing languages, starting from the extensions of C and C++ defined by Cilk, OpenMP and Intel TBB. This will be the context of the tools developed in the project, to the extent that the tools are at all language dependent. Project Structure The proposed research follows an iterative process revolving around work packages addressing the demonstrator, systems oriented research and disciplinary research. The demonstrator will strongly tie the different aspects of the project together in coherent systems of methods, technologies, and tools with the end goal to provide industrial-strength support for the safe task-based parallel programming methodology and tools for manycore processors with several hundreds of processors. The demonstrator is also used in the parallelization effort for application patterns representing systems products of our industry partners. Given the nature of the topic, all of the research in this proposal is systems oriented to a higher or lower degree. All partners have a strong systems oriented experience as shown in the research group section below. One purely systemsoriented work package is planned; the remaining work packages are predominantly disciplinary. The work packages are initially: 1. Application patterns Systems oriented research 2. Demonstrator Demonstrator 3. Performance modeling Disciplinary research 4. Constraint-aware Distributed Task Scheduling Disciplinary research 5. Dynamic analysis Disciplinary research 6. Type and effect systems for safety Disciplinary research In the work package descriptions below, we indicate the partner leading each work package, the expected outcomes and annual milestones. Given the nature of the type of strategic research, these milestones need to be revised every year, for which we will seek advice from an international advisory board. Although all work-packages of the project are interrelated, none is dependent on any other in order to start except for the demonstrator. Work in WP1, 3-6 will therefore start immediately and gradually over the project lifetime knowledge from all work-packages will be integrated in form of the demonstrator which is built and released two times per year. Experimental Methodology In order to evaluate our approach and results we need to test them on real and simulated parallel computers. We will use the simulator Simics, from the Swedish SICS spinoff Virtutech, augmented with architectural models such as GEMS/Ruby from the Wisconsin Multifacet project. As argued in WP3 (Performance Modeling Tools), simulators have a limit and we intend to follow the technology pace and each year acquire access the current state-of-the-art parallel systems to demonstrate the effectiveness of our results. Initially we will use smaller multicore platforms and a moderately large shared memory parallel computer at Uppsala University with 48 cores. Although this is not a system built out of multicore 3

processors, any good performance results will in general be on the safe side as communication costs will be much higher in this system relative processing speed than in any multicore processor. In order to demonstrate our approach on future non-uniform cache/memory/communication systems, other platforms have to be used. Given the length of the project, such systems are bound to appear sooner than later and will be used as experimentation and demonstration vehicles. The experimental platform is not directly part of the demonstrator, but used by it and by most other work packages. Work Package 1: Application Patterns The traditional methodology to test future computer systems has been to measure the systems performance (power, reliability, etc.) under the load of programs from established benchmark suites. This methodology has the inherent drawback that it predicts the performance of future technology with yesterday s workloads. As an alternative, researchers at UC Berkeley have proposed the use of application patterns from important problem domains to represent the workloads on future systems. At Berkeley, a pattern is referred to as a dwarf and is defined as an algorithmic method that captures a pattern of computation and communication. An application pattern can in its most simple form consist of a single piece of code implementing the core of an important functionality of an industrial partner s software. In the general case, an application pattern can be a relatively complex software system. The Berkeley application patterns are mostly from the numerical domain, and are not in general typical of the needs of the European softwareintensive systems industry, which is predominantly in the embedded domain. Therefore, a significant effort in this project will be devoted to the identification and formulation of patterns representative of the industrial systems of our partners and to explore task parallel versions of these patterns. Application patterns allow us to capture the essential parallelization challenges while being free of the IPR and secrecy problems associated with production code. This work package consists of two sub-tasks: (1) the identification, characterization, documentation, and iterative refinement of domain specific application patterns and (2) iteratively refined task-based parallelizations of the identified application patterns using the demonstrator from WP2. Both tasks involve close collaboration between the academic and industrial partners. We aim at performing parallelization using the demonstrator directly at the sites of our industrial partners, to evaluate both the programmer productivity and performance scalability aspects of the technology. An important aspect of this activity is that the knowledge transfer is bidirectional since we expect both the programming model and the supporting tools to evolve in response to the experience gained in their use. The identification of patterns in different domains will directly benefit all partners, as insight in different patterns will increase the awareness of how applications can be parallelized. Expected outcomes: A set of application patterns representing important software algorithms and designs for the industrial partners. Published reports describing and characterizing these application patterns from a parallelization perspective. Work Package 2: Demonstrator The project will demonstrate the development of safe and efficient parallel software for future multicore platforms based on project results using application patterns representing high value, high impact systems of the project partners. The project demonstrator is a coherent system of methods, technologies, and tools emanating from the disciplinary work packages, together with coding guidelines evolving with the demonstrator, and is applied to application patterns from the systems work package: Application patterns Performance modeling tools Constraint-aware and distributed task scheduling algorithms Dynamic analysis tools Tools for safeness checking and safe parallelization Coding guidelines The application of the demonstrator can be started at any phase of the development of a task-based parallel program. Assuming that we are given a sequential pattern, we can use dynamic analysis to identify potential sources of parallelism to be annotated with tasks according to the coding guidelines, then perform safeness checking of the resulting task parallel 4

pattern, analyze and predict performance on different numbers of cores, and finally verify the results using an execution platform based on scalable task scheduling. The analysis results together with the coding guidelines will steer the programmer to more easily maintained code with higher performance. Debugging, although an important task, is not part of this project proposal. As we aim for safe parallelism, debugging the parallel program is fundamentally equivalent to debugging the underlying sequential program. The safe parallel programming model guarantees that no new bugs are introduced in the parallelization effort. (For debugging to scale, the debugging execution should be parallel, but with the same external behavior as sequential execution.) The demonstrator includes the development of a set of coding guidelines which, in conjunction with the tools and task schedulers, will lead to easy development of safely parallel programs. The demonstrator will be developed in an iterative process with two annual full integrations, providing more functionality and coverage by each generation. Expected outcomes: A coherent set of software analyzers, task-schedulers, run-time systems and coding guidelines. Status reports and manuals describing the demonstrator on a yearly basis. Work Package 3: Performance Modeling Tools The current state-of-the-art for assessing performance and other characteristics of future multicore systems and of general multiprocessors is to use simulation. A model of the system under study is designed, typically as a computer program, and a workload is run on the simulated system. In the most detailed simulation models, processor internals as well as the memory hierarchy and interconnect design are modeled in great detail making the simulator exceptionally slow. Although it is accepted to relax on the details when it comes to the processor internals, it has been shown that small changes in the model of the memory system may have significant effects on the system behavior resulting in the need for multiple simulations with small random variations in e.g. main memory latency, increasing the simulation time. On the other hand, task based parallelism has a simple performance model relating expected execution time T(n) on n processors to the sequential T(1) and the unboundedly parallel execution T( ); T(n) < T( ) + T(1)/n. This model does, however, not take locality or memory hierarchy issues into account. Future-proof performance estimations must not rely on detailed simulation while still being able to take important architectural characteristics such as memory hierarchy, number of cores and locality into account. We therefore need modeling techniques that capture the inherent behavior of programs and which can extrapolate the information to the use of more cores. We propose to extend the current techniques and models to support the requested extrapolation capability. Besides statistical sampling of memory references from a sample execution, these models are expected to make use of data collected by the run-time scheduler (WP4), the dynamic analyzer (WP5) and the type-and-effect system (WP6). Expected outcomes: Prototype tools extending StatCache-MP with predictive capabilities and low-overhead sampling. Published patents, reports and dissertations describing this work. Work Package 4: Constraint-Aware and Distributed Task Scheduling Almost all available work on task scheduling for multicore processors refer to static task scheduling where tasks are mapped to processors (cores) at compile-time or when the tasks are created. This is neither flexible enough nor capable of meeting the challenges we are facing. Examples of dynamic task scheduling are Cilk, Intel Threading Building Blocks and parts of OpenMP version 3.0. In these models, threads are used as workers that execute tasks from a pool of tasks. Typically one thread is started per core. In Cilk and TBB, the scheduling of tasks onto worker threads is distributed by means of a task stealing algorithm where a worker thread steals tasks from another thread s task pool at specific thread synchronization points. In OpenMP, the algorithm is implementation dependent and more control over scheduling is available to the programmer if desired. The schedulers currently available for these systems do not scale well to future manycore systems. The Intel TBB scheduler is inherently coarse grained as it is entirely invoked as a run-time library. In contrast, the Cilk and OpenMP schedulers can be implemented partly as a library routine and partly by controlling the output from the compiler. Still the Cilk scheduler, which is a randomized task stealing algorithm, is totally oblivious of locality and the non-uniformity aspects we will experience in the future. 5

We will in this work package extend current state-of-the-art by introducing a scalable, constraint-aware task scheduler implemented in existing prototype systems for task-based parallelism. Some of the constraints and requirements on the task scheduler that we need to consider are: Tasks should be scheduled on cores where its data are located. Tasks should be scheduled on cores close to other cores where tasks it needs to communicate with are located. The scheduling algorithm must be distributed in order to scale with the increase in cores. The scheduler should interact with power and thermal management systems which typically adapt the number of available cores to match power and temperature constraints. One main goal of this work package is to define a task model that permits as fine-grain parallelism as possible. A scheduler entirely implemented in a run-time library may be used to schedule tasks with a granularity on the order of thousands of instructions. A compiler-supported scheduler may improve this to tasks with a granularity on the order of hundreds of instructions. Finally, we may envisage hardware supported schedulers that can support a granularity as small as on the order of tens of instructions. Also, the scheduling model should be both flexible and predictive in order to achieve the best parallelism and be possible to model. This work will be performed in three steps. First, we will explore dynamic scheduling algorithms, taking application fingerprints as input and dynamically rescheduling tasks to cores of a non-uniform architecture while minimizing the performance impact of task interaction, shared cache usage and memory interface pressure. Fingerprints will be collected during short time batches and new scheduling decisions will be made in a semi-dynamic fashion to minimize the overhead. Several scheduling alternative will be evaluated, such as fair-share, maximum throughput, and minimum off-chip bandwidth. This scheme will work well for scheduling of coarse grain tasks, but will not be a good fit for activities with a short execution time. One sub-goal is to develop algorithms capable of predicting such sharing between independent applications, for which a fingerprint has been captured in isolation. Oracle scheduling will be our second step towards a scheduler supporting more short-running activities. We will measure dynamic information about the activity at runtime and determine what would have been the best scheduling decision at dispatch time, had we known this information then. Based on this oracle information, scheduling alternative similar to those mentioned will be evaluated. Our third step is historic scheduling. The historic scheduling activity aims at developing heuristics and methods for predicting the use of resources shared by several concurrent activities based on past history. We will develop heuristics that allow us to predict good scheduling at dispatch time based on past history. For example: the last time this user started a job with these parameters it resulted in a specific performance fingerprint. The scheduling decision at dispatch time is based on the assumption of a similar fingerprint this time. In a similar way, tasks created in certain manner will be assumed to have a similar performance fingerprint as their reminiscent activities of the past. Expected outcomes: Prototype constraint-aware and distributed schedulers incorporated in existing compilers and runtime systems such as the ones for OpenMP (gcc version 4.4), Cilk-5 or Intel TBB. Published reports and dissertations. Patents, where applicable. Work Package 5: Dynamic Analysis Dynamic off-line dependence analysis (DDA) is a powerful approach to finding useful potential parallelism in sequential parts of legacy code as well as checking safeness of task parallel code. It is based on observing the dependencies that occur in a running program. This can be accomplished by instrumenting the program itself or running it under an instrumented emulator (our current prototype uses Valgrind). The dependencies are collected in dedicated analysis runs, so the overhead does not affect the performance of production runs. Since DDA is based on execution, it is highly language independent and applicable to programs written in a combination of languages and where source code is only available for parts of the program. It is also exact, in contrast to static analysis that must always be conservative, so it can be used to find places where static analysis will over-approximate the dependencies. DDA is applied either to sequential parts of legacy code with the aim of uncovering potential parallelism or to the sequential reading of task parallel code to ensure safeness. Since DDA observes particular executions, its results are not guaranteed to hold for all possible executions. However, preliminary results indicate that selecting inputs that cause the entire program to be executed typically reveals all dependencies. If the analyzed program is deterministic, dependencies 6

can only be missed due to an insufficient set of test inputs and the problem can immediately be identified by rerunning analysis with the offending input. Thus with DDA, all debugging takes place in a sequential setting. In this work package we will investigate DDA, producing a complete tool applicable to industrial software. Issues to tackle are DDA for multi threaded code (with the objective of parallelizing individual threads), constructing models for predicting the confidence with which one can generalize the absence of dependencies for the test inputs to all inputs, combining the instrumented approach (which collects all dependencies) with hardware based statistical sampling techniques with several orders of magnitude lower overhead, and the extension to a parallelization support tool giving advice on the most profitable parallelization opportunities. Expected outcomes: Tools for parallelization support and safeness checking based on dynamic dependence analysis. Published reports and dissertations. Work Package 6: Type and Effect Systems for Safety We will here develop theory, methods, and tools for verifying the safety of a task-based parallel program. A safe task-based parallel program is guaranteed to have the same semantics as the underlying sequential program. This will eliminate parallel programming as an additional source of race condition software defects, with great impact on quality and productivity. Our approach is to extend the state-of-the-art in type and effect systems to express and derive constraints on the allowed data dependencies between tasks. This is analogous to data type systems, which constrain the possible values of variables. Where data types are associated with variables, effect types capture the side effects that may arise from the execution of a statement. Effects are reads and writes to regions, which represent static division of the address space into disjoint subsets. Thus if a task writes to a region r, then no parallel task may either read or write to the same region r. We believe that formulating static dependence analysis as a type system has several advantages. Type systems have strong compositional properties; the information about a program fragment is represented in its (annotated) type, and once that type has been determined, the fragment itself need not be analyzed further. This property enables seamless support for inter-procedural analysis, even in the presence of first class procedures, as well as modular analysis. In contrast, non type based static analysis often requires the availability of the entire source code. Programs need not contain explicit type information since a type inference algorithm can reconstruct the types that programs have to have to make the program correct. Static analysis strikes a balance between the precision and performance. Traditionally in automatic parallelization, the designer of the analyzer makes that tradeoff. One of the attractive features of type inference in the context of safe parallelism is that a parallel programmer can use type annotations to guide the analyzer. Type annotations can be arbitrarily precise, while still checkable by the analyzer. Thus, the safety of a parallel program can be verified that would never have been generated by an automatic parallelizer. We will in this work formalize and investigate the theoretical properties of a series of type systems allowing for progressively more powerful parallel patterns to be type checked, an activity that will advance the state-of-the-art in polymorphism, effect types and dependent types. We will also implement type inference tools that will automatically check program safety. The theoretical results will be applicable to existing explicit parallel programming environments such as X10, OpenMP or Cilk and the tools will work with at least one of these languages. Expected outcomes: Prototype tools for parallelization support and safeness checking based on effect typing. Published reports and dissertations. Contribution to competitiveness The multicore revolution is forcing a paradigm shift onto the software industry. No longer will increasing clock frequency in new generations of microprocessor hardware drive a commensurate increase in the performance of sequential programs. To leverage the performance increases of future hardware generations, all software must be parallel. To be future-proof, the parallel performance must scale with the number of cores. To be competitive, the added requirement of parallelism must not hamper engineering efficiency. 7

This affects systems industry dramatically since it is highly software-intensive. For instance, Ericsson s CEO recently described the company as the world s fifth largest software company. About 80% of the development costs can be attributed to software. Software represents over 50% of development costs in several other companies. At the same time, only 1% of developers are familiar with parallel programming and the development cost of parallel software is 2-3 times higher than for normal software. Mastering this grand challenge will require a paradigm shift for software development, in Europe as in the rest of the world. A race has begun for new methods, technologies, and tools. This situation has been recognized all over the world and has motivated several research initiatives. In particular, the Computing Systems call in the EU Seventh Frame Program addresses this issue, as does the ARTEMIS technology platform. In the US, the computer industry has formed strategic partnerships with academic institutions, with the expectation to obtain a competitive advantage. Microsoft and Intel are spending $20M over five years at the University of California at Berkeley and the University of Illinois at Urbana-Champaign. AMD, HP, Intel, NVidia, and Sun are spending $6M over three years at Stanford University. There appears to be no simple technical fix for the problem. While techniques such as automatic parallelization, transactional memory and thread level speculation are attracting interest, we believe that they are short term solutions that do not scale to hundreds or thousands of cores. This level of parallelism must be designed into the code from the start, which necessitates the evolution of a programming model that forms a new division of responsibility between programmers and machines. Task parallelism is emerging as one of the main contenders as the future of parallel programming and is at the heart of several important technologies, such as OpenMP 3.0, Cilk, the Intel TBB, and as of recently also the.net platform. We believe that our proposal adds the following unique features: 1. Dynamic dependence analysis is a language independent method for discovering potential and profitable parallelism in new and legacy code. 2. The use of type and effect systems for the vital role of ensuring safeness has the attractive property that types in most cases can be derived automatically, without any type annotations in the program, while still allowing for type annotations as a way to guide the safeness checker in especially tricky situations. 3. User level schedulers adapting to run-time locality information have the potential to efficiently exploit inter-task locality, which is crucial since tasks are too fine grained to have enough internal locality. 4. Task based modeling of memory system performance will improve the predictive power of the current models which only take the number of operations and length of critical path into account. In a recent list of priorities of parallel programmers collected by Petersen, Robinson, Leasure and Mattson, the first three items were: 1. Finding concurrent tasks in a program, both for legacy code and code written from scratch. 2. Scheduling tasks at the right granularity onto the processors of a parallel machine. 3. The data locality problem: Associating data with tasks. This proposal directly addresses these top priority tasks. Potential Consortium A proposal for a consortium could include the following identified partners: KTH (Royal Institute of Technology), Manycore architectures, operating system scheduling algorithms of hw resources as well as threads, parallel programming models SICS (Swedish Institute of Computer Science), Tools for data dependence analysis, run-time systems for taskbased parallelism, type and effect systems Ericsson AB, Application provider and software developer. Your expertise here? 8