A Scalable Control and Monitoring Framework to Aid the Development of Supercomputer Applications

Transcription

1 A Scalable Control and Monitoring Framework to Aid the Development of Supercomputer Applications Gregory R. Watson IBM Systems & Technology Group Carsten Karbach Forschungszentrum Jülich GmbH Wolfgang Frings Forschungszentrum Jülich GmbH Albert L. Rossi Fermi National Accelerator Laboratory Claudia Knobloch Forschungszentrum Jülich GmbH ABSTRACT The development of scientific applications for parallel computing systems is becoming increasingly challenging. Petascale systems are now becoming readily available to the scientific computing community, and planning is underway to achieve exascale within the next decade. The vast power of these systems, coupled with a corresponding increase in application code complexity, is making the limitations of existing programming and performance tools ever more apparent. If developers are going to be able to effectively utilize these systems, then a new generation of tools will be required that seamlessly integrate with each other and the target systems on which they operate. The open source Parallel Tools Platform (PTP) Project was established in 2005 to create a best-practice integrated tool workbench designed to increase the productivity of parallel application development. PTP has increased in popularity over the years, and is now used by a growing community of developers in scientific and engineering fields. PTP must also adapt to the new petascale and exascale environments, however, and in this paper we describe some of the recent changes to PTP core infrastructure that will enable it to work effectively with these and future generations of high performance computing systems. 1. INTRODUCTION Recent announcements have heralded in a new generation of petascale systems, including most recently the National Center for Supercomputing Applications (NCSA) Blue Waters system and the National Center for Atmospheric Research (NCAR) Yellowstone machine. The top 10 systems on the November 2011 TOP500 list 1 all now exceed one petaflop peak performance. 1 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX...$ The drivers for this massive increase in computational power are the large and complex applications now being used to perform some of the most detailed and accurate numerical simulations ever contemplated. For these applications to be successful the utmost level of performance must be extracted from the hardware; it is no use having a 10 petaflop system if only a fraction of the resource can actually be used. Unfortunately, the complexity of both the applications and the computer systems they run on is also stretching the limits of existing programming and performance tools. If developers are going to be able to effectively utilize these systems, then a new generation of tools will be required that provide significantly improved capability over the current ones. In 2005, the Parallel Tools Platform (PTP) project 2 was established in order to advance the state of parallel application development and provide and integrating framework for the development of parallel tools. As PTP has gained in popularity, and is now being contemplated as the development environment for systems such as Blue Waters 3, it is becoming increasingly important for the platform to be able to support these petascale systems. In this paper, we will present our recent work to improve the scalability and usability of PTP. In Section 2 we will discuss the motivation for the changes in more detail. In Section 3 we will present the overall architecture of PTP and the components on which the current work is concentrating. In Section 4 we will present improvements to the scalability of the platform, and in Section 5 we will provide details of how the environment is now significantly more extensible. Section 6 proposes some areas of future work, and Section 7 concludes the paper. 2. MOTIVATION Many tools are available to aid developers of HPC applications, ranging from compilers to build systems to performance analysis and tuning tools. However most of these provide either stand-alone GUI s or are command-line tools, which complicates the developer s work for a number of reasons. First, the developer must spend considerable time understanding and learning the different tool interfaces. Second, the tools typically do not share information, so the de NCSAreceives.html

2 veloper must set up and configure the tools with the same information multiple times. Finally, the developer s workflow is encumbered with the need to manually switch between tools in order to access the desired functionality. Integrated development environments (IDEs) have long been used to overcome all these issues (and more), and are best practice for most of the software engineering industry. Strangely, HPC is one of the few disciplines that have not accepted the productivity benefits that IDEs have been shown to deliver, although this is now changing. A number of past efforts have attempted to create integrated environments for developing parallel programs [1] [2] [3] but few, if any, of these survive today. In addition, there are a variety of tools available for monitoring job and system status on large high performance computing systems [4] [5], and some batch systems provide facilities for remote job submission and monitoring. PTP is unique to the authors knowledge, in that it provides integration not only of a broad range of development tools, but with the systems themselves, allowing the developer to submit jobs and monitor activity on one or more target systems from within the development environment. This ability enables developers to streamline their development workflow so that they can avoid time-consuming and costly context switches between different tools, something that is essential for increasing developer productivity. The PTP project brings together a range of tools for developing C, C++, Fortran and UPC applications into a single integrated environment. In addition to advanced editing, project management, and integration with version control systems (CVS, Subversion, and Git), there is also support for MPI development, and an integrated parallel debugger [6]. A number of performance tools have also been integrated 4. When PTP was first developed, even the largest systems were relatively small compared to today s machines 5, and the ability to monitor the entire system and user s jobs was relatively straightforward. Advances in system size have lead us to make changes to the PTP core infrastructure in order to improve the scalability of system and job monitoring and to simplify the process of adding support for new systems and job schedulers. For scalable monitoring, we have based the implementation on an existing batch system monitoring tool called LLview [7] that is known to be highly scalable. This enables information about the user s job execution, along with a full overview of the system, to be viewed at regular intervals. Such a live view enables greater awareness of the target system, its complexity and the circumstances under which the user s jobs are running. The monitoring component is able to show the current usage of the full system, including the mapping between the jobs to the compute nodes and the load of the batch queues. We plan to extend this in forthcoming versions to display a prediction of the future system usage based on the current state. This would provide the user with, for example, more detailed reasons as to why a job is currently not started by the batch system. For extensibility, we have designed a completely generic framework for controlling job submission to a target system, whether interactive or batch. Support for a new type of job 4 University of Oregon s TAU, IBM s HPC Toolkit, and others 5 At the time, the world s largest system was a 1024 node cluster at Los Alamos National Laboratory. scheduler, for example, can be completely specified via an XML definition file. This specification includes commands to run on the target system to perform job related activities (e.g. submission, termination, etc.) as well as information on how to layout the user interface so that users can supply resource related information required by the target system. 3. ARCHITECTURE Even in modest supercomputing installations, the computing resources routinely used by developers are scarce and must be shared by large numbers of users, such systems are usually centrally located in specialized facilities, and must be accessed remotely. They are also typically highly customized systems, and rarely employ the same system software or development tools. PTP is a set of plug-ins for the Eclipse Platform that extends its functionality to provide various features for assisting HPC application developers in these types of environments, particularly those using MPI and other parallel programming models. 3.1 Control and Monitoring Frameworks PTP provides developers with a variety of techniques for simplifying the way in which remote computing resources are accessed and utilized, and these have been discussed in detail elsewhere [8]. The two key components of PTP that are the focus of this paper are the control and monitoring frameworks. These are the primary mechanisms for hiding the intricacies of the complicated system software used on HPC machines, and are the means by which PTP users launch and debug jobs, and monitor activity on a target system. The control framework provides an abstraction of a batch scheduler, interactive runtime system, or some other means of controlling jobs on a system. The monitoring framework provides a mechanism for monitoring the system and jobrelated activity on a target machine. The control and monitoring frameworks operate completely independently, however it is also possible to link a control and monitor implementation so they can be used together when appropriate. PTP also allows multiple control and monitoring systems to be defined, each able to interact with different types of systems simultaneously, even if these are on completely independent machines. Figure 1 shows a high-level architecture of PTP. On the left, the Eclipse platform provides the user s main development environment, and acts as the client for developing HPC applications and accessing supercomputing resources; typically this client runs on the user s workstation or laptop. The control framework issues commands to submit new jobs and perform operations on existing jobs (such as cancelling a job). The control framework is also responsible for handling standard output generated by batch or interactive jobs, standard input to interactive jobs, as well as initiating debug sessions. Launching a job via the control framework uses the normal Eclipse launch configuration mechanism. The monitoring framework manages a content model as well as views of system and job information, and is responsible for collecting data from the batch system or interactive runtime system (or both, depending on the configuration of the target system) and presenting this information to the user. Communication with the target system for both control and monitoring is via a single SSH connection (raw TCP/IP sockets can also be used).

3 Figure 1: High-level architecture of PTP. On the left is the Eclipse client that normally runs on the user s workstation or laptop. On the right is the supercomputing resource that the user is developing and running applications on. Interaction between the client and the target system is required for launching, controlling, and monitoring applications. An agent is used on the target system to manage the formatting of monitoring data. 3.2 Model Driven Architecture Both the control and monitoring frameworks use XML data formats to drive the user interface and other functions. The control component uses XML for its definition files, which provides information about the target system, such as the type of batch system, commands for job submission and control, as well as the layout of the launch configuration user interface. The monitor component uses XML for communicating monitoring and layout information between the Eclipse client and the target system. PTP uses the Java Architecture for XML Binding (JAXB) to map XML information directly to Java classes so that each XML element has a corresponding Java representation. When the framework needs to access a configuration file, or receives an XML formatted message, the XML is unmarshalled and merged into the internal content model. Various parts of the Eclipse user interface, including the launch configuration and the monitoring and status views, are driven directly from this content model. 4. SCALABILITY Adapting PTP to systems at petascale and beyond requires careful consideration of all aspects of the interaction between the local Eclipse client and the target system on which the user s jobs will be running. This is especially the case for system monitoring, which must process and present large amounts of monitoring data generated by these systems. Unless extreme care is taken, it is very easy to overwhelm the user by presenting too much information, or to run into speed or memory constraints when trying to process the information within Eclipse. The scalability of PTP s architecture was demonstrated during SC11 in November 2011 by simulating a full scale BG/Q system of approximately 1.6M cores. In the following sections, we will discuss the scalability features of the monitoring component in more detail, concentrating on three main areas for increasing scalability: the data representation, the user interface components, and remote data acquisition. 4.1 Monitor Data Representation PTP s monitoring framework uses a client-server model. The server is responsible for collecting information about the target system and the jobs on the system. This information is then passed to the client where it is presented to the user via the Eclipse user interface. We have designed the Largescale system Markup Language (LML), which is an XML schema that defines the structure of this monitoring data [9]. LML can be used to describe the status of arbitrarily large computer systems; there are no restrictions on the system s architecture or size. LML is designed so that one instance 6 provides a snapshot of the current system s state. It consists of a set of independently presentable graphical objects specified by simple elements such as table, textbox, diagram and more complex elements such as nodedisplay. The server generates these element along with elements containing the data that is to be displayed. The nodedisplay element is the most important part of LML for providing an overview of the system s state. It presents a graphical view of the system and displays the physical location of jobs currently running on the system. The nodedisplay element contains two children: scheme and data. The scheme element defines the physical hierarchy of the target system, while the data element associates physical components with dynamic aspects of the system, such as the nodes on which jobs are running. Listing 1 shows an example of a nodedisplay element used to represent a system like the Jülich Blue Gene/P. The scheme element is used to define a system comprising 72 racks, where each rack has 32 node cards and a node card has 32 chips, each of which contains 4 cores. Following this is a data element that 6 We use instance to refer to an XML-document which is valid against the LML schema.

4 uses the same hierarchy defined by the scheme element for addressing physical components. Every el element (el1, el2, etc.) within the data element has an oid attribute, which is used to reference other elements in the LML data (such as a user s job on the system.) The oid attribute is also inherited by children of an element, which eliminates any redundancy in the data. Elements with identical oid attributes can be specified more compactly using ranges. Listing 1: LML nodedisplay example 1 < nodedisplay title =" Jugene " id=" nd" > 2 <! -- Physical system structure -- > 3 < scheme > 4 <el1 tagname =" rack " min ="1" max ="72"> 5 <el2 tagname =" nodecard " min =" 1" max =" 32" > 6 <el3 tagname =" chip " min ="1" max ="32"> 7 <el4 tagname =" cpu " min ="1" max ="4"/> 8 </ el3 > 9 </ el2 > 10 </ el1 > 11 </ scheme > 12 <! -- Connect physical elements to current jobs --> 13 <data > 14 <el1 min =" 1" max =" 36" oid =" j1" status =" running "> 15 <el2 min ="9" oid ="j2"/> 16 </ el1 > 17 <el1 min =" 37" max =" 72" oid =" empty " status =" idle " description =" racks broken "/> 18 </ data > 19 </ nodedisplay > Using this approach, it is possible to represent system information to any level of detail required, or to vary the level of detail used to represent different parts of the system. By eliminating parts of the hierarchy that are not required to be displayed, it is very easy to reduce the volume of information transmitted to the client. Another scalability technique we employ is to avoid repetitious data, which can be collected together using special elements (e.g. job names and the colors used to identify the jobs in the view). It is also possible to display the physical structure defined by the data element in different levels of detail. The tree described by the data element can be eliminated from the client s view at any level. This reduces the amount of detail shown and hence the view s complexity. However, just eliminating the lower levels of the tree results in a loss of data for the user. To avoid this, we provide a mechanism that summarizes the lower levels into a flat data structure that still represents a valid representation of the child elements known as a usage bar. A usage bar is defined as a map, whose keys are job references and whose values are the amounts of the smallest units defined by the scheme element (which is cpu in the above example.) This map can be generated for each data element in the nodedisplay by calculating the total number of leaf elements and the number of leaves assigned to each job. A usage bar disregards the connection between jobs and corresponding compute resources, but ensures job information is still presented regardless of which level of detail is shown. LML also provides the table element which is used to represent tabular data such as jobs running on the system. The first part of the table element comprises column elements that specify information about each column in the table. Following the column definitions are row and cell elements that specify the contents of the table. In order to reduce the amount of data transmitted, column elements can contain pattern elements which specify the condition under which row data will be included in the table. For example, if the user only wants to see jobs belonging to them, a pattern would be added to an owner column specifying a user name to match. Only rows containing this user name would be included in the table data sent to the client. 4.2 User Interface Once the client acquires an LML instance from the server it must be rendered in the Eclipse user interface so it can be viewed by the user. In addition, the user interface must react to user input such that associated information across the different LML components is visually emphasized. For example, nodes on which a job is running are highlighted when the user selects the corresponding line in a table of job information. Users can also customize the client view by hiding, positioning, and scaling graphical components individually. As a result one LML instance can lead to a number of different client views Nodes View The primary user interface component is the view in which the nodedisplay element is shown. Each of the physical elements specified by the scheme element is rendered as a rectangle, with children painted recursively within this rectangle. The data elements are then expanded and elements on the lowest levels are filled with colors to identify the jobs running on them. Figure 2 shows how this hierarchical arrangement is used to display a full scale (96 rack) Blue Gene/Q down to the node card level. Because both the client and server sides (as discussed in more detail below) allow the level of detail to be defined, it is possible to display systems of virtually any size. The view also allows the user to zoom into physical elements to see more details about the subtree (also shown in Figure 2). This allows a high-level view to be used to avoid overwhelming the user with information, while allowing the user to exploit the detailed information available in the LML instance. A usage bar can also be generated for each data element summarizing the content of its subtree. If the view is collapsed to a lower level of detail, usage bars are painted into the rectangles instead of just filling them with a single color. Currently the view presents only a single node of the data tree. This node is usually the root, but can be altered by zooming into subtrees. We plan to extend this to allow an arbitrary number of trees to be displayed in the view. This would be useful for displaying the first five racks of a system, for example Table View A table view is used to render additional information provided by the target system, such as the list of queued or active jobs. Once again, care must be taken not to overwhelm the user with information, as the size of these tables tends to increase with the size of the system, and because there is more information about jobs and physical elements transmitted. To keep tables manageable, the user is able to sort table data, hide columns, and take advantage of mouse interaction to visually connect displayed information across all graphical components.

5 Figure 2: Screenshot of a full scale 96 rack IBM Blue Gene/Q simulator (1.6M cores) The left hand side of the display shows an overview of the entire system comprising 12 rows of racks (row 12 is scrolled off the screen). Each row comprises 8 racks containing 2 mid-planes, each mid-plane contains 16 node cards. The middle image shows the display zoomed into one row, and the left image shows the display zoomed into a single rack. 4.3 Data Acquisition The acquisition of monitoring data is also an important scaling issue. This is because obtaining the full system state of a large parallel system may require a significant number of resources. Although LML provides a scalable data format to store information about the system components using a hierarchical structure, this is generally not the case for resource management systems such as batch schedulers. Most of these systems only provide a flat data representation of the system, for example as a list of nodes and associated state information. As a consequence, a full system query could comprise a huge number of elements, leading to long query times and large amounts of data. Moreover, large numbers of users performing such queries frequently and simultaneously would place an unacceptable burden on the resource management system. For full system monitoring, as is implemented in PTP, the user experience can be improved by mapping various attribute values to hardware components. For example, the identifier of a batch job can be mapped to the node that it is running on in order to give a visual indication of the utilization of the system. In general, the full system view has to represent one or more N-to-M mappings of attributes to components. However, at least one side of these relations has to be minimized, otherwise scalability becomes an issue. To address this, LML provides the hierarchical tree model in the nodedisplay element, which can be directly exploited when generating the mapping information. In systems that have logical or physical hierarchies of components (e.g. Blue Gene/Q has partition configurations that can be described as sets of base partitions which are typically mid-planes or node cards), this information can be used directly to generate the mapping to inner tree nodes of the nodedisplay element. For resource management systems which do not provide such a hierarchical representation, the queries have to be optimized in another way. The key to minimizing scalability problems when transitioning from a flat structure to the tree model is to perform the mapping as early as possible, and at the highest level of abstraction. For a system such as Blue Gene/Q this would be the mid-plane (or node card), while for a traditional clusters this might be compute nodes consisting of several processors or cores. The acquisition of monitoring data on the remote system is performed by a set of Perl scripts that use the standard batch system query functions to obtain data about jobs, nodes, and other useful status information. To allow more flexibility, functions which are related to a particular batch system are separated into a driver layer. Typically these functions are realized by individual small scripts querying and generating LML code for one type of information (e.g. jobs). This simplifies the process of adapting to a different or newer version of batch systems. The scripts also provide mapping tables from batch-system-specific attribute names to an attribute naming scheme defined by LML. The LML data generated by these scripts is combined into an LML intermediate format, containing only a list of objects, and for each object, a list of corresponding attributes. When a client requests monitoring information, the request information and intermediate format are used to generate LML data containing the appropriate elements (e.g table, nodedisplay, diagram, etc.) for displaying the data to the user. Storing the monitoring data in an intermediate format

6 provides a number of advantages. In particular, other tools can generate data in intermediate format, and this can be merged with the monitoring data to enhance the utility of the data. We plan to provide such a tool in the future, which simulates system usage based on the current usage and job load on the system. When merged with the monitoring data, this adds new attributes that show predicted start time, ending time, and the nodes on which a job will run. 4.4 Scaling Results In order to demonstrate the scalability of the monitoring system, we used the results from a 96 rack Blue Gene/Q simulator that was demonstrated at SC11. This system is equivalent to the LLNL Sequoia system that recently became #1 on the Top 500 list. For the node card-level of detail (as shown in Figure 2), the update time, including collection of the data from the target system, was less than 10s, which we consider well within acceptable refresh times. The system has also recently been monitored to node-level detail, with similar results. We have also run a number of tests on a variety of XSEDE systems, including the National Institute for Computational Sciences (NICS) Kraken and Keeneland systems, and Texas Advanced Computing Center s (TACC) Lonestar and Ranger systems, as well as Argonne National Laboratory s Blue Gene/P and Q. Monitoring of all these systems was within acceptable times. 5. EXTENSIBILITY Eclipse provides a standard framework for launching applications, and PTP uses this mechanism to support launching applications via the control framework. The Eclipse launch framework allows the user to create a launch configuration, which encapsulates all the information necessary for the application to be successfully launched, such as the location of the executable and any required arguments, and then enter this information via a user interface. Once the launch is configured, a Run button is selected, and the appropriate actions will be taken to launch the application. Launch configurations are persisted across Eclipse sessions, so once created they can be reused again for future job launches. For launching jobs via PTP s control framework, a launch configuration specifies the type of batch system being used on the target machine. The user does this by choosing the batch or runtime system type from a list, and then providing some additional connection and authentication information. This normally only needs to be done once. Once the job is submitted, users may receive an indication of the job status (queued, running, etc.) via a view in the user interface, where they can see all the jobs they have submitted. The same interface is also used for controlling the job. If supported by the target system, the output from the job can be viewed directly from with the Eclipse user interface in a console view. In the case of interactive jobs, this output is displayed immediately as the job begins to execute. For batch jobs, the output is generally available once the job has completed. 5.1 Definition File Format A key feature of the control framework is that all the launch information required to interact with the batch system is contained within a single XML definition file. Users are able to import definition files into their workspace in order to add support for additional systems. This conveniently overcomes the existing limitation where the set of supported batch or interactive systems is fixed for each PTP release, and also allows system administrators to make sitecustomized definition files available to their users. The XML definition file specifies how the resource manager will interact with the system, and how information obtained from the batch system will be presented to the user in order to successfully launch a job. A definition file schema describes the format of this configuration file and defines the following main types of elements: attribute, command, and launch-tab. Attributes are used to represent information that is passed between the user interface and the target system. Commands specify how jobs are to be launched and controlled, and how job status information is to be obtained. Launch tabs are used to define the user interface for entering job specific information. Listing 2 shows a section of the definition file for the PBS resource manager. Listing 2: Example definition file 1 <resource - manager - builder name ="pbs - torque - generic "> 2 <control - data > 3 < attribute name =" queues " visible =" false "/ > 4 < attribute name =" destination " type =" string "> 5 < description > Designation of the queue to which to submit the job. </ description > 6 < tooltip > Format: queue ]. </ tooltip > 7 < default >debug </ default > 8 </ attribute > 9 <start -up - command name ="get - queues "> 10 <arg >qstat </ arg > 11 <arg >-Q</ arg > 12 <arg >-f</ arg > 13 <stdout - parser delim ="\n"> 14 < target ref =" queues "> 15 <match > 16 < expression > Queue: ([\ w\d ]+) </ expression > 17 <add field =" value "> 18 <entry valuegroup =" 1"/ > 19 </ add > 20 </ match > 21 </ target > 22 </ stdout - parser > 23 </ start -up - command > 24 <launch - tab > 25 <basic > 26 <title >Basic PBS Settings </ title > 27 < composite group =" true "> 28 < widget type =" combo " style =" SWT. BORDER " readonly =" true " savevalueto =" destination " > 29 <layout - data > 30 <grid - data horizontalalign =" SWT. FILL " horizontalspan =" 2" grabexcesshorizontal =" false "/ > 31 </ layout - data > 32 <items - from > queues </ items - from > 33 </ widget > 34 </ composite > 35 </ basic > 36 </ launch - tab > 37 </ control - data > 38 </ resource - manager - builder >

7 Figure 3: Resources tab of the Torque launch configuration showing the basic settings that are configurable by the user. This gives a good indication of the variety of widgets and layout options that are available for batch system implementers. Although not shown here, there are also elements available for importing, editing and using external job scripts, for managing files automatically on the target system, and for job submission and control commands (e.g. terminate, hold, release, etc.). In addition, there are commands available for specifying an interactive launch via a batch system, and for launching a debug job. 5.2 Model Driven Configuration A content model is created directly from the XML definition files and is used by the control framework to dynamically generate the launch configuration user interface. This user interface comprises a number of tabs, each of which allows the user to supply different kinds information required for the launch. For the control framework, one of these tabs, the Resources tab, is rendered directly from the widget elements specified in the launch-tab section of the XML definition file. Attributes and parameters defined in the configuration are then used to communicate user choices to the job submission command. This allows the tab to be completely customized to suit a particular batch system or runtime system. Figure 3 shows the basic settings tab for the Torque job scheduler definition. The content model is also used by the control operations of the framework in order to interact with the target system. Activities such as job submission, querying job status, and handling standard output redirection, are all driven by data obtained from the model elements. The framework also uses the presence or absence of model elements to determine if particular actions are available. For example, the absence of a submit-interactive element would indicate that the resource manager supports batch-only submission. 6. FUTURE WORK We have provided a range of techniques for improving scalability by reducing the volume of data transferred between the client and server. However, there is still additional work we plan to do in this area. In particular we have plans to implement a mechanism to send only the differences between two successive LML instances. As few changes to system state typically occur in the short intervals used for monitoring data collection, the transmission of only the differences can be quite efficient. However, for this approach to work, the server has to manage system states for every connection in order to be able to compute the difference between two successive LML instances. In addition, the LML schema has to be extended in order to provide support for handling differences, so that incomplete data can still be well-formed and valid against the LML schema.

8 Our initial implementation provides generic support for job submission and monitoring using a number of batch systems, including PBS, Torque, ALPS, LoadLeveler, SLURM, and GridEngine. We also support interactive submission using IBM s Parallel Environment, Open MPI, MPICH2, MVAPICH, as well as a simple remote launching capability. We are planning to add support for specific systems, such as those participating in the Extreme Science and Engineering Discovery Environment (XSEDE), as well as a range of other machines. Finally, we will continue to make improvements and enhancements to other parts of PTP not discussed in this paper, including new refactoring tools for Fortran, improvements to the support for remote synchronized projects, as well as enhancements to the parallel debugger. Many of these new features and improvements will be available in the 6.0 release of PTP on June 27, CONCLUSION If PTP is to continue to provide best practice tools for the development of parallel application codes, it must be scalable and extensible enough to meet the demands of the next generation of peta-scale systems and beyond. There are a number of different areas where scalability in such a development environment becomes important, such as the ability to manage an extremely large code base, the performance of integrated tools that operate on source and object code, the ability of user interfaces to present large amounts of data in a meaningful way, and the capacity to provide an abstraction of the very large systems that are being targeted by the developer, amongst others. In this paper we have only examined the latter two. In particular, we have described how modifications have been made to PTP in order to scalably monitor a target system of arbitrary size and present this information to the user in a useful manner. We have built this functionality on an existing framework that we know is highly scalable, and that has been used in production systems for some years. In addition, we have discussed how a completely generic configuration system has been designed and implemented that significantly improves the ability of PTP to be extended to support additional batch and runtime systems. This configuration system provides a completely customizable way of interacting with the plethora of target environments and systems that are currently available. Although model-driven architectures are not new, this is the first time that such an approach has been used in a development environment. We believe that these new enhancements will greatly encourage third party developers to expand the support base of the platform. By doing so, we hope to expand the community of developers and users who see PTP as one of the key technologies for dealing with the complexities of peta-scale computing environments. Acknowledgements The authors would like to acknowledge the efforts of many contributers without whom the Parallel Tools Platform would not exist. This includes the Eclipse Foundation, Los Alamos National Laboratory, Monash University, IBM Corporation, University of Oregon, Oak Ridge National Laboratory, National Center for Supercomputing Applications and others, along with the many individuals who have shared their ideas and suggestions. Thanks also to Simon Wail for his work demonstrating PTP on BG/Q, and for providing the screenshots in Figure 2. This material is partly based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under its Agreement No. HR , the United States Department of Energy under Contract No. DE-FG02-06ER25752 and Program DE-PS02-08ER08-19, and by the National Science Foundation under award number OCI REFERENCES [1] K. D. Cooper, M. W. Hall, R. T. Hood, K. Kennedy, K. S. McKinley, J. M. Mellor-Crummey, L. Torczon, and S. K. Warren, ParaScope: A Parallel Programming Environment, in Proceedings of the IEEE, 1993, pp [2] C. Clemencon, A. Endo, J. Fritscher, A. Muller, R. Ruhl, and B. J. N. Wylie, The Annai Environment For Portable Distributed Parallel Programming, in Proceedings of the 28th Hawaii International Conference on System Sciences. Washington, DC, USA: IEEE Computer Society, 1995, pp [3] P. Kacsuk, J. C. Cunha, G. Dózsa, J. a. Lourenço, T. Fadgyas, and T. Antão, A Graphical Development and Debugging Environment For Parallel Programs, Parallel Comput., vol. 22, pp , February [4] M. L. Massie, B. N. Chun, and D. E. Culler, The Ganglia Distributed Monitoring System: Design, Implementation, and Experience, Parallel Computing, vol. 30, no. 5-6, pp , [5] R. Buyya, PARMON: A Portable and Scalable Monitoring System for Clusters, Softw. Pract. Exper., vol. 30, pp , June [6] G. R. Watson and N. A. Debardeleben, A Model Based Framework for the Integration of Parallel Tools, in Proceedings of the 2006 IEEE International Conference on Cluster Computing, September [7] W. Frings, Interactive Monitoring of LoadLeveler Controlled Clusters with LLview, Available from the ScicomP11 web site: ScicomP11/Presentations/User/frings.pdf, June [8] G. R. Watson, C. E. Rasmussen, and B. R. Tibbitts, An Integrated Approach to Improving the Parallel Application Development Process, in Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing. Washington, DC, USA: IEEE Computer Society, 2009, pp [9] C. Karbach, Konzeption und Umsetzung einer Beschreibungssprache für Statusinformationen von Parallelrechnern als Basis einer Webschnittstelle für LLview, August 2010, FH-Aachen.