Performance and Energy Monitoring Tools for Modern Processor Architectures

Transcription

1 Performance and Energy Monitoring Tools for Modern Processor Architectures Luís Taniça INESC-ID / IST - TU Lisbon Rua Alves Redol 9, -9 Lisbon, Portugal luis.tanica@ist.utl.pt Abstract Accurate on-the-fly characterization of application behavior requires assessing a set of execution-related parameters at run-time, including performance, power and energy consumption. These parameters can be obtained by relying on hardware measurement facilities built-in modern multi-core architectures, such as performance and energy counters. However, current operating systems do not provide the means to directly obtain these characterization data. Thus, the user needs to rely on complex custom-built libraries with limited capabilities, which might introduce significant execution and measurement overheads. In this work, we propose two different tools for efficient performance, power and energy monitoring of systems with modern multi-core CPUs, that allow capturing the run-time behavior of a wide range of applications at different system levels: i) at the user-space level, and ii) at kernel-level, by using the OS scheduler to directly capture this information. Although the importance of the proposed monitoring facilities is patent for many purposes, we focus herein on their employment for application characterization with the Cache-aware Roofline model. The experimental results show the capabilities of the proposed tools to deliver detailed and accurate information about the behavior of real-world applications on the underlying architectural resources. Moreover, they allow reconstructing and identifying the execution patterns of the profiled benchmarks from standard suites (SPEC CPU6), while introducing negligible overheads. I. INTRODUCTION Modern computing systems are complex heterogeneous platforms capable of sustaining high computing power. While in the past designers aimed at improving processing performance by applying power hungry techniques, e.g., by increasing the pipeline depth and, therefore, the overall working frequency, such techniques became unbearable due to the power wall. To overcome this issue, processor manufacturers turned to multi-core designs to increase performance, typically by replicating a number of identical cores on a single die. Each core includes a set of private coherent caches and dedicated execution engines, and they usually provide hardware support for multiple threads. In addition, all cores share the access to a common higher level memory organization, typically containing the last level cache and the main memory. Although these solutions allow increasing the overall processor performance, they also introduce additional complexity into the design, making it harder for application designers to fully exploit the available processing power. For example, the contention caused by multiple cores competing for the shared resources can drastically affect the execution efficiency. In addition to the these issues, current trends in computer architecture show that future processors and applications have to consider novel techniques to improve power and energy efficiency. In order to characterize and understand the behavior of such complex computational systems, we require accurate realtime monitoring tools. These allow, for example, to identify application and architectural efficiency bottlenecks for realcase scenarios, thus giving both the programmer and the computer architect hints on potential optimization targets. While many profiling tools have been developed in the last years, e.g., PAPI [] and OProfile [], it is not always easy to convert the acquired data into insightful information. This is particularly true for modern processors, which comprise very complex architectures, including deep memory hierarchy organizations, and for which several architectural events must be analyzed. Taking into account the complexity of modern processor architectures and the effects of having different applications running concurrently in multiple cores, the Cache-aware Roofline Model (CARM) [] was proposed, which unveils architectural details that are fundamental in nowadays application and architectural optimization. The CARM [] is a single-plot model that shows the practical limitations and capabilities for modern multi-core general-purpose architectures. It shows the attainable performance of a computer architecture as an upper-bound, by relating the peak floating-point performance (Flops/s), the operational intensity (Flops/byte), and the peak memory bandwidth for each cache level in the memory hierarchy (Bytes/s). The model is a single plot model that considers data traffic across both on-chip and off-chip memory domains, as it is perceived by the core. In this work two different monitoring methods are proposed, that combine the advantages of the recently proposed CARM, with real-time accurate monitoring facilities in a way that allows application developers to easily relate the application behavior with the architecture characteristics, thus fostering new application optimizations. The two different monitoring tools proposed herein rely on performance and power/energy hardware interfaces and are able to extract in real-time important power and performance characteristics of the running application, namely: SPYMON - The user-space tool that aims at a systemwide performance analysis. The main principle behind this tool relies on spawning a process to each processor core, which handles the profiling operations for that

2 system s component. This tool is intended to integrate both performance and power consumption monitoring, and it is provided to the end user via a simple to use interface; SCHEDMON - the tool that is mainly implemented from the kernel-space. Thus, it makes use of the OS internal scheduling events in order to detect context switching and to provide more accurate monitoring results. Similarly to SPYMON, this tool also provides all its functionality to the end user in an intuitive and easy-to-use command line interface. In addition, there is a possibility of performing a run-time evaluation, by means of a provided user-space library, which exports the kernel-space core functionality into user-space programs via a set of simple calls. Although there are several profiling tools available that allow to obtain performance or power consumption information, there are only a few that provide both functionalities in a single interface. In addition, even if a full performance configuration is provided, the choice of the proper performance events to monitor is not always trivial, nor the proper way of evaluating these, in order to obtain a complete overview of the application attainable performance on the underlying architecture resources. Finally, the ability to provide the full performance and power consumption evaluation must be passed to the end-user as an easy-to-use interface. However, some of the most powerful state-of-the-art performance interfaces are too complex [6] or not fully documented, which hampers their usage. The results reported in this paper illustrate the differences between the two proposed monitoring tools, and show the importance of using such monitoring techniques for the understanding and characterization of application execution on modern general-purpose processors. Overall, both monitoring methods are able to provide insightful information about the behavior of the applications and how its execution is affected by the processor architectural limitations. The remainder of this paper is organized as follows. Section II reviews the main infrastructures used by the proposed tools, and provides some insights on the state-of-the-art tools. Section III details the first proposed performance and power monitoring approach: SPYMON. Section IV details the other proposed tool: SCHEDMON. Section V presents experimental results, including comparisons between both tools and the analysis of set of SPEC CPU6 [9] benchmarks. Finally, Section VI presents a few concluding remarks. II. BACKGROUND Most modern processors contain Performance Monitoring Units (PMUs) that can be configured to count microarchitectural events such as clock cycles, retired instructions, branch miss-predictions and cache misses. To count these events, a small set of Model-Specific Registers (MSRs) is provided by each architecture, which limits the total number of events that can be simultaneously measured. For example, on the Intel Sandy Bridge and Ivy Bridge architectures, the PMU facility provides two main MSRs types []: Performance Monitoring Select Registers (PMSRs), which are used for configuring the events to monitor (count) in each corresponding counter; Performance Monitoring Counters (PMCs), which hold the actual counter value. These recent architectures extend the PMCs by adding a new type of counter, Performance Fixed Counter (PFC), which provides additional performance information without the need of any configuration. Thus, these counters can only be enabled and disabled and monitor events predefined by the architecture. On Intel architectures, the PMU does not provide power or energy consumption measurements. In order to assess this information, the Running Average Power Limit (RAPL) interface must be used. This allows simultaneously obtaining real-time energy consumption readings for several different domains, such as package, cores, uncore or Data Random- Access Memory (DRAM). A. Linux Kernel Modules Although PMU readings may be performed from user-space, configuring PMCs must be done from privilege level. RAPL energy status MSRs are not allowed to be written and should also need special permissions in order to be assessed. In Linux systems, there are only two different permission levels: i) the user-space, which comprises hardware privilege levels, and ; and ii) the kernel-space, which operates at privilege level. Therefore, in order to obtain the required permissions for handling the performance and/or power monitoring infrastructures, software interfaces must contain some component that runs in the kernel-space side. This can be achieve in two different ways: i) by modifying the OS kernel source, which requires the recompilation of the OS kernel; or ii by using a Linux kernel module, which is a piece of code that is allowed to be integrated into the Linux kernel, at run-time. The vast majority of Linux kernel modules is designated as a device driver, despite of being or not attached to a physical device []. The herein proposed tools make use of kernel modules, which although not connected to any kind of peripheral device whatsoever, may be logically seen as a way to access the physical hardware resources comprising performance and power/energy monitoring and, therefore, to call them drivers. B. State-of-art Monitoring Tools There are many options in the literature that provide access to hardware performance counters. In the case of Linux, one of the earliest was the perfctr patch [] for x86 processors. Perfctr provided a low latency memory-mapped interface to virtualized 6-bit counters on a per-process or per-thread basis. Later on, the perfmon [] interface was submitted to the kernel. When it became apparent that perfctr would not be accepted into the Linux kernel, perfmon was rewritten and generalized as perfmon [] to support a wide range of processors under Linux. After a continuing effort over several years by the performance community to get perfmon accepted into the Linux kernel, it too was rejected and supplanted

3 by yet another abstraction of the hardware counters, first called perf counters in kernel.6. and then perf events [6] in kernel.6.. Perf events is included in the Linux kernel, which makes it the preferable choice over the other available interfaces. The interface is built around file descriptors, allocated using the introduced system call sys_perf_event_open(). This system call returns a file descriptor representing a virtual performance counter. Events are specified at open time by using an elaborate perf_event_attr structure, which contains more than fields that can interact in complex ways. PMCs are enabled or disabled via ioctl() calls and their value can be read using a call to read(). Sampling can be enabled to periodically read the counters and write the values to a circular buffer, which must be allocated using mmap() call. Signals are sent to the process holding the referred file descriptors when new data is available. Although perf events has shown to be a quite powerful interface, it might be too complex for the common user. Moreover, it does not provide access to the RAPL interface. If one requires monitoring power along performance, a different interface has to be used. PAPI [] is one of the available tools that uses perf events. Its objective is to be highly portable by reusing the available Operating System (OS) performance interfaces, while allowing the inclusion of plug-ins to read other counters, such as those provided by NVIDIA Graphics Processing Units (GPUs). PAPI provides two interfaces to the underlying counter hardware: a simple, high-level interface and a fully-programmable low-level interface. The high-level interface only provides functions for starting, stopping and reading the counters. The low-level interface provides much more manageability and control over the available resources. Event multiplexing, multithread support, user callbacks on threshold and statistical profiling are some of the available functionalities. Recent versions of PAPI also include the possibility to measure power/energy consumption [7]. On the other hand, if a deep control over the available performance resources is needed, PAPI might not be the best way to do it, since it does not provide direct access to the performance unit but virtualizes it instead. If one is interested in a quick binary profiling, without having to write code to do it, Perf [] might be a more preferable choice. This is a profiling Linux command-line tool and one of the most referenced. It can be seen as an abstraction to the perf events interface, much more accessible to the common user. Perf provides a set of commands which allow not only profile but also to report profiling in a user-friendly way. It provides support for multi-threaded applications, event multiplexing and statistical profiling, among others. A processorwide mode is also available, allowing the user to profile not the application but the system itself. However, this tool lacks the possibility for power profiling, which obligates the search for other tools when energy information is a requirement. Yet another well-known resource is OProfile [], which is composed by a Linux kernel driver, a daemon and perflike command line tool. OProfile s kernel driver is meant for abstracting the performance hardware registers and dump the sampling information at regular intervals. The daemon can be started and stopped by the user and it is responsible for consuming the profiling information provided by the kernel driver and save it in OProfile s sampling database. This database can later be accessed by the user to extrapolate useful profiling information by using the command-line available tools, like opreport. Although this tool appears to be complete in terms of performance, it still lacks the functionality for providing energy status information. There are several other profiling tools available, like Intel VTune Performance Analyzer [7], LIKWID [] or LIMIT [6]. The choice of the right tool is not always trivial and relies mostly on the user needs. For instance, one may require higher abstraction, lower overhead, higher control or more information detail. III. USER-SPACE MONITORING TOOL: SPYMON SPYMON s main goal is to provide a portable tool with an intuitive interface for the end-user, without relying on the underlying OS s monitoring facilities. Hence, most of SPYMON s implementation lies in the user-space, in order not to interfere or depend on the running system. SPYMON targets a core-oriented approach, by monitoring the behavior of each Logical Processor Core (LPC) and therefore being able to capture the information of all running applications. As a result SPYMON allows monitoring the whole system, regardless of what is running at a given time instant on each LPC. This means that, even if the application migrates to another core, launches new threads, or its execution is constrained by the contention caused by other running applications, SPYMON is able to capture all these execution events. A. Architecture The herein proposed tool is composed by three main parts that interact in a hierarchical way, namely i) the monitor, which controls the tool s execution flow and provides all the functionality to the user; ii) a set of spies, which are responsible for communication with the PMU interface and for handling the performance profiling information; and iii) a linux kernel module, which provides the access to the hardware facilities, thus overcoming any privilege access restrictions. SpyMon s module, or driver, is meant for overcoming the privilege restrictions that are usually associated with the hardware monitoring interfaces [8] and is composed by i) a small number of structures for communicating with the tool; ii) the addresses of the underlying performance and power MSRs; and iii) a set of functions that operate over these data structures and allow reading from and writing to the underlying hardware. At the time of the module s installation, a new device file is created in the /dev directory, allowing the communication between the user-space processes and the driver. The module is accessed by both the monitor and the spies, by relying on the ioctl() system call over the device file. By using this call, the tool is not only able to send

4 Physical Core Physical Core Physical Core Physical Core LPC LPC LPC LPC LPC LPC6 LPC LPC7 App. App. App. App. App. (Thread ) (Thread ) (Thread ) (Thread ) (Thread ) Start Start L L SpyMon monitor L L SpyMon spy L Fig.. Spacial perception of SpyMon while monitoring threads from applications. a specific command to the module, but also to specify an argument, which is used to send the proper data structures, either for holding the sample readings or for configuration purposes. The monitor process does not only control the flow the tool s execution flow, translating its functionality to the user as a simple command-line interface, but it is also responsible for monitoring power/energy consumption, when required. On the other hand, each spy is launched with the purpose of handling the performance evaluation on a single LPC. The typical SpyMon configuration is to launch a spy to monitor the performance of each available LPC and to pin the monitor to the last one, as shown in Figure. In the illustrated example, the monitor forks 8 new processes (spies) and pins each of them to a different LPC. By default, the monitor process is pinned to the last LPC, but different configurations are possible. The spies are responsible for handling the communication with the PMU, in order to output the obtained PMU samples. Since this work relies on facilities for monitoring energy consumption at the level of the whole chip (RAPL), the monitor is responsible for the communication with these facilities and, therefore, for reading the energy status information. B. Execution and Implementation When started, the tool firstly parses the input parameters (step ). If the --help sub-command is provided, the usage information will be printed to the standard output (step 8). If the --list sub-command is provided, then the complete list of available hardware events (step 9) is shown. On the other hand, if either --start or --roof argument is provided, then monitoring parameters are configured according to the user s input specifications, and the application profiling is initiated. In brief, --start activates the most commonly used SpyMon profiling mode, while --roof enables runtime cache-aware roofline application monitoring. When the ----start command is provided, the tool firstly parses and verifies the input parameters. Then, the monitor process is pinned to a specific LPC (step ), by using the sched_setaffinity() system call. This call allows informing the scheduler in which LPCs is the calling thread allowed to execute in. By default, the monitor is pinned to the last available LPC, although its affinity can be changed by the end-user in the initial tool configuration. L L L L help Parse Options roof command () Pin monitor Display usage (8) Display predefined (9) information () Launch & events pin spies no Apps yes Send PMU config -p options Get RAPL sample Stop spies and terminate End start no Output RAPL sample () () yes (6) (7) list () Launch & pin apps (a) Monitor execution diagram. Fig.. SpyMon s execution flow. no Set comm with monitor Get PMU config Configure evset Wait mux interval (e) Read PMU counters evsets > yes Output PMU sample terminate End no yes (a) (b) (c) (d) (g) (f) Configure next evset no sample yes (b) Spy execution diagram. Afterwards, the main process forks several new processes (spies), which number corresponds to the number of required target monitoring cores (step ). By default, all LPCs are monitored, i.e., SpyMon firstly detects the number of available LPCs and launches one spy for each LPC. The general execution diagram for a single spy process is depicted in Figure b and it starts by setting the pipe communication channel with the monitor (step a). Following the process spatial configuration, the PMU configurations are made (step ). After parsing the provided Performance Monitoring Event (PME) configuration, SpyMon creates a number of event-sets by grouping the events according to the number of available PMCs. For instance, if the architecture only supports PMCs and the user provides 7 PMEs, then the tool will define event-sets, where the first event-set contains the first provided PMEs and the second one contains the remaining PMEs. When the PMU configuration is done, the monitor sends the configuration structures to the spies, by means of the previously established pipe communication channels. In the spy execution diagram (see Figure b), this corresponds to step b. From this point on, each spy starts monitoring its target LPC and producing the performance sampling information accordingly. If more than one event-set is defined, then event multiplexing is applied. In these cases, a single performance sample corresponds to a specific predefined time interval in which the performance counter readings from different event-sets are merged together. This mechanism is performed by the spies, since they are responsible for handling the PMU. When performing a system-wide evaluation, it is usually required to launch specific applications and analyze their performance, as specified by the end-user. SpyMon provides this functionality via a set of simple configuration commands,

5 that instructs not only the target applications launches, but also pinning their execution to the required LPCs (step ). This is achieved by using the fork() and execve() system calls. When all the initializations and configurations are performed, SpyMon initiates profiling. At this point, the monitor process also starts reading and producing RAPL sampling information (steps 6-7), until the monitored application terminates. At the same time, each spy reads and produces PMU information samples at regular intervals (steps d-g). As it can be observed, as soon as the performance counter readings are retrieved for the current event-set (step e), each spy activates the event multiplexing if the number of eventset is greater than (evsets > ), or immediately outputs the counter readings if not (step g). When event multiplexing is activated, the next event-set is configured (step f), i.e., a set of different events will start being counted during a multiplexing time interval (step d). When the last event-set counts are retrieved, the sample is considered to be (sample ) and its contents are outputted (step g). The described process (steps d-g) is repeated one time for each sampling time interval, until monitoring is completed. The information produced by both the monitor and the spies is directly printed out to files. The number of files corresponds to the number of processes executed inside the tool, i.e., to the number of spies plus the monitor, where each file corresponds to only one of those processes. C. Usage As already referred, SpyMon provides main subcommands: --help, --list, --start and --roof. The set of supported options can be retrieved with the --help parameter, which also provides a short summary on how to use SpyMon s interface with different options. The --list option outputs a list of the predefined hardware events provided by the tool. The --start allows defining the events to monitor (which can be predefined or raw events), the PFC configuration, the target cores for monitoring, enabling power consumption information, define the sampling time interval and provide target applications for being monitored. The --roof only provides the ability to define the sampling time interval, enabling power consumption information and providing the target application. The event-sets used for obtaining the predefined CARM analysis are shown in Table I. As it can be observed, it is required to monitor 6 different events in order to assess the number of Floating Point (FP) operations, and additional events to estimate the amount of data traffic. As a result, event multiplexing is required and the event-set configuration is made according to the information shown in Table I. IV. SCHEDULER-BASED MONITORING TOOL: SCHEDMON In contrast to SpyMon, SchedMon is mostly implemented from the kernel-space and aims at an application-based evaluation, thus allowing a deep profiling of the monitored applications. SchedMon main principles rely upon its modularity, which allow to easily extend its functionality. In order to User-space Kernel-space Hardware Smon System calls SchedMon s device SchedMon s Library SpyMon s kernel module RAPL Shared memory Fig.. SchedMon s components interaction and disposition in the OS privilege layers. achieve this, the tool is designed not to depend on the available OS performance interfaces (e.g., perf events), i.e., it does not rely on any already implemented structure and functionality. A. Architecture SchedMon is composed by two main parts, namely: i) a Linux kernel module, or driver, which integrates the tool s implementation core mechanisms; and ii) a user-space tool (smon), which extracts the whole functionality of the underlying module and translates it into a simple and intuitive user interface. The communication between both components is made by means of a specific developed user-space library, which provides a set of functions that allow handling the tool s main functionalities. Figure illustrates the interaction between SchedMon s components, as well as the their disposition in the OS privilege layers. As it can be observed, the Linux kernel module is responsible for interacting with the hardware, thus providing the necessary performance and power/energy consumption information, via PMU and RAPL facilities. The communication between the module and the user-space tool is made through a set of system calls over the driver s device file. These necessary communication commands are provided by the tool s user-space library. In addition to the commandrelated communication, a shared memory area is used for exchanging the produced profiling information, at run-time. B. SCHEDMON s Linux Kernel Module SchedMon s kernel module, or driver, is the main component of the tool, since it provides the main functionality and holds all data structures implementations. The driver, when loaded into the kernel, creates a file in the /dev directory (the device) which acts as a communication medium to the driver, i.e., the operations over this file trigger the corresponding module s function to handle that specific operation. The main operations over SchedMon s device file include: PMU ioctl - allows configuring new PMEs, registering new tasks for monitoring and consuming profiling information from the shared memory from the user-space; mmap - allows initializing the shared memory between the kernel and the user-space, which is used for exchanging the profiling information samples;

6 TABLE I SETS OF PMES USED FOR PERFORMANCE PROFILING WHEN USING THE CACHE-AWARE ROOFLINE MODEL. Event Set Event Description FP_SSE_PACKED_SINGLE Number of SSE single-precision FP packed µops executed. FP_SSE_PACKED_DOUBLE Number of SSE double-precision FP packed µops executed. FP_AVX_PACKED_SINGLE Number of AVX 6-bit packed single-precision FP instructions executed. FP_AVX_PACKED_DOUBLE Number of AVX 6-bit packed double-precision FP instructions executed. FP_SSE_SCALAR_SINGLE Number of SSE single-precision FP scalar µops executed. FP_SSE_SCALAR_DOUBLE Number of SSE double-precision FP scalar µops executed. MEM_UOP_RETIRED_ALL_LOADS Qualify any retired memory µops that are loads. MEM_UOP_RETIRED_ALL_STORES Qualify any retired memory µops that are stores. SSE - Streaming SIMD Extensions; FP - floating-point; AVX - Advanced Vector Extensions; µops - micro-operations poll - implements the synchronization mechanisms used by SchedMon in order to coordinate the read and write operations over the previously allocated shared memory. By relying on the above described calls, the full control and configuration of SchedMon s driver from the user-space can be attained. The following text described the main structures and mechanisms implemented by SchedMon in order to achieve its full funcionality. ) Events, Event-sets and Environment: SchedMon keeps all the profiling configurations inside the driver. The three main structures used for keeping track of the registered performance configurations are: events - the structure that holds a specific PME configuration; event-sets - structures used to aggregate a number of events into several groups, that should be configured into the PMU; environment - this structure is temporarily created at the time of the monitoring execution and contains all the profiling configurations for that specific run, e.g., the required profiling information, sampling time interval and event-sets to monitor. This structure hierarchy allows not only reusing the same event configurations across different event-sets, but also reusing the same event-sets across distinct runs. ) Monitored Tasks: SchedMon defines two types of tasks: leaders and children. In order to monitor an application using the tool, the target process, or thread, must be registered into the driver. For this, an ioctl() system call with the proper request must be performed. The task registration request requires two distinct arguments: the target PID, which is the task identification parameter, and an environment data structure containing the profiling configuration. Under SchedMon s driver, every registered task is appointed as a leader. On the other hand, a child corresponds to a task descending from a leader. This only applies if the inherit option is enabled upon the leader task registration, otherwise the driver will not register any children descending from that task. Each leader task, which is registered in the driver, is associated with a performance environment, i.e., a data structure containing the profiling execution configuration. Whenever a child is allocated by the driver, it inherits its leader performance environment and, therefore the same configuration. ) Scheduling Infrastructures: SchedMon s driver makes use of the OS scheduler tracepoints in order to attain the full control over the monitored tasks execution, and it is driven by the following OS scheduling events: sched_process_exec() - this event is triggered whenever a task performs an execve() system call and it is used by SchedMon in order to start monitoring a target task from the exact beginning of its execution; sched_process_fork() - this triggered whenever a task forks another task and it is used by the tool s driver in order to enable monitoring of multi-threaded applications; sched_switch() - this event is triggered in a specific LPC whenever the Linux scheduler switches the current executing task by another one. SchedMon uses this event to detect when a monitored task is scheduled in or scheduled out from a specific LPC in order to perform the necessary PMU context switch; sched_migrate_task() - this event is triggered whenever a task migrates from a specific LPC to another. This is used by SchedMon to facilitate the search for a monitored task; sched_process_exit() - this event is triggered whenever a task terminates its execution and it is used to detect the termination of monitored tasks. ) Sample Types: SchedMon s driver currently provides five different types of samples, that can be enabled through the environment structure at the time of the registration of a target task: PMU - these samples contain the performance information and are always enabled by default; RAPL - these samples contain the energy status information obtained from the RAPL interface; migration - a migration sample is provided each time the monitored task (or tasks) migrates to a different LPC; fork - a fork sample is provided whenever a monitored multi-threaded application forks/creates a new task; scheduling - scheduling samples contain the information when the task was scheduled in and scheduled out from a specific LPC, by the Linux scheduler. ) Sampling: Sampling refers to the process of extracting specific information from the execution at regular time intervals. In order to provide accurate performance sampling, several auxiliary data structures are used for this process.

7 The main data structures used for performance sampling are: i) the array containing the different event-set configurations; ii) a Linux high-resolution timer for synchronization purposes and sampling at the nanosecond granularity; and iii) a temporary PMU sample, which holds the current sample counts. Thus, each time a monitored task is executed in a specific LPC, the corresponding event-set is configured into the PMU and the timer is set to trigger after the current sample remaining time. When the timer is triggered the PMU information is obtained and a performance sample is produced. When RAPL sampling information is enabled, a distinct high-resolution timer is set to trigger at regular time intervals. Each time the timer is triggered, the energy status information is obtained and provided to the user-space as a RAPL sample. 6) Kernel-User Communication: As previously referred, the profiling information is exchanged between the kernel and user-space by means of shared memory. The mechanism used by SchedMon to exchange profiling information with the user-space comprises: i) a memory ring-buffer; ii) a virtual memory area, shared between the kernel and the user-space; and iii) a synchronization procedure. The virtual memory area is allocated by the user-space process (smon) by means of the mmap() system call, before starting the profiling. When SchedMon s driver detects a mmap() call, it creates a shared memory ring-buffer, which will operate as the communication medium between the driver and smon. The synchronization is performed by means of the poll() system call. Hence, whenever the required amount of samples is available for consuming, the driver signals the user-space process. The userspace process can then access the corresponding sampling information and, after that, it alerts the driver by using the ioctl() system call. C. User-space Tool The SchedMon s user-space component, smon, is integrated in the tool in order to facilitate the access and handling of the underlying driver. By relying on the driver s user-space library for configuration and the mmap() and poll() system calls, smon translates the whole tool s functionality into an easy to use command-line interface. The main functionalities of smon include: i) the creation of events; ii) the definition of event-sets, by using the already created events; and iii) the ability of profiling an application. Function call tracing represents the process of detecting whenever a target application (the tracee) enters or leaves a function call. This is an important feature for detecting the potential execution bottlenecks for the most time consuming parts of the application. The method used by smon to detect the entering and returning points of a function requires preprocessing the dumped assembly code of the application. The detected execution points are then assigned to breakpoint structures, which hold the original bytes contained in those positions and are used to inject code to the same memory addresses. SchedMon is able to trace the target task s function calls by recurring to the ptrace() system call and by injecting a trap instruction in each breakpoint. As a result, the tool is able to track whenever a breakpoint is hit. Smon is also able to detect whenever a new process is forked or switch its execution image, thus allowing call tracing multi-threaded applications, even when different tasks execute distinct binaries. D. Usage SchedMon provides a similar interface to SpyMon, that translates the full functionality of the tool to the end-user in a simple and intuitive way. The interface is composed by main commands: smon-event - this command allows to add new PME configurations into the tool; smon-evset - this command allows to add new eventsets to the tool. Each event-set must be composed by a number of already defined events; profile - this command allows profiling a provided target application. In addition, it provides several options for configuring distinct profiling parameters (e.g., sampling time interval, the required sample types, the sharedmemory size and the event-sets to be monitored); smon-roof-run - this command allows performing a predefined performance evaluation of the target application according to the CARM. In order to achieve this, SchedMon relies on the same predefined event-sets as SpyMon (see Table I). A. Experimental Environment V. EVALUATION RESULTS The herein presented results were executed in a machine containing an Intel i7 77K processor, which is an Ivy Bridge based micro-architecture with physical cores and with hyper-threading support, i.e., 8 LPC. It operates at.ghz, although it can attain.9ghz in turbo boost mode, and its memory organization comprises cache levels of kb, 6kB and 89kB, respectively. The cache levels L and L are shared between the LPC contained in the same PPC, and the last-level cache, L, is shared between all the cores. The DRAM memory controllers support up to two channels of DDR operating at x9m Hz. For executing the following tests, the processor s clock was set to continuously run at a fixed frequency of.ghz. Moreover, the machine provides a PMU containing PFC and PMC and a RAPL interface. B. SpyMon ) System-wide Analysis: Figure illustrates a performance evaluation of four distinct SPEC CPU6 benchmarks (milc, namd, GemsFDTD and tonto). In order to obtain the depicted results, each benchmark test was executed individually, without the interference of any other applications (with the exception of the OS tasks). For each execution, the benchmark process was pinned to its corresponding LPC, as shown in Figure. Each of the shown LPCs was chosen in order to belong to a distinct Physical Processor Core (PPC). After running each of the four tests individually, a final run was performed, in which all the four tests were run at the same

8 (c) Namd alone (core ) (b) Namd running alone (core ) (a) Milc running alone (core ) (b) Milc with others (core ) (a) Milc alone (core ) (d) Namd with others (core ) (e) GemsFDTD alone (core ) (g) Tonto alone (core ) (d) Tonto running alone (core ) time. The obtained results are presented in Figures b, d, f and h. In each of the runs, the sampling time interval was set to ms. By analyzing Figure, several informational details can be extracted. First, all the benchmarks achieve lower performance when run alongside each other, due to a shared resource contention. This conclusion can be taken by observing that i) each benchmark duration is longer when run alongside others, and ii) each benchmark performance values are significantly lower. Another interesting observation relies on the shapes of the obtained plots, where different parts of the execution can be detected. For instance, when running the milc benchmark alone (see Figure a), at least three distinct execution phases can be identified, where each of them ocurrs in refulat time intervals and delivers different attainable performance (in GF lops/s). However, when run together, the shapes of each benchmark execution appear to change according to the concurrent applications. For example, the shape of the GemsFDTD benchmark is completely distorted when run with the other applications (see Figures e and f). Figure depicts the experimentally obtained power consumption for the above described test conditions. The plotted information corresponds to the package domain, i.e., it represents the power consumption of the whole chip. When each milc (h) Tonto with others (core ) Fig.. SpyMon performance evaluation of SPEC CPU6 benchmarks, for a ms sampling time interval. tonto namd GemsFDTD (e) Four benchmarks run simultaneously. Fig.. Power consumption of four benchmarks run separately and simultaneously. AVX (ADD,MUL) / SSE (MAD) Roofline (f) GemsFDTD with others (core ) (c) GemsFDTD running alone (core ) 8 SSE (ADD,MUL) / DBL (MAD) Roofline DBL (ADD,MUL)Roofline Operational Intensity [flops/bytes] - DBL SSE AVX (a) CARM analysis according to the FP(b) Power analysis according to the FP types. types. Fig. 6. SpyMon evaluation of the SPEC CPU6 benchmark tonto. The sample time interval was set to ms benchmark is executed alone (see Figures a, b, c and d), the chip power consumption is around W. As it can be observed, the power consumption does not only depend on the core being activated or not, but also on the resource utilization. For instance, as shown in Figure a the power consumption assumes a shape similar to the one observed in the milc performance profile (see Figure a). On the other hand, Figure e shows the power consumption when all benchmarks were simultaneously executed. A it can be observed, each additional activated LPC corresponds to an increment of approximately W in the system s power consumption (see Figure e). ) CARM an Power Evaluation: Figure 6 illustrates the CARM and power consumption evaluation of the tonto bench-

9 CPU 6 79 CPU CPU 796 CPU 787 CPU CPU CPU grsource_imp() imp_gauge_force() ks_congrad() eo_fermion_force() 786 CPU Time [sec] 786 Fig. 8. Milc performance colored according to its function call tracing profile. Fig. 7. Scheduling information for OpenCL application fdtd. mark. Figure 6a contains the CARM information for tonto. This test presents two distinct performance parts, which contain the predominant scalar and Streaming SIMD Extensions (SSE) FP types, respectively. As it can by observed in Figure 6b, the two distinct parts of execution are interchangeably switching in time. During the parts of the execution corresponding to the scalar instructions (DBL), one can conclude that tonto is mainly memory-bound, since it is at the left side of its correspondent ridge point, both for ADD/MULL and MAD roof lines. In fact, Figure proves that these zones of the execution are memory dependent and inflict changes in the performance shapes of applications running alongside. On the other hand, when executing SSE instructions, it is considered to be more compute-bound. C. SchedMon ) Multi-thread Scheduling Informaiton: SchedMon allows not only to detect and monitor multi-threaded applications, but it also provides the means to analyze the scheduling route of each task s execution. This allows to obtain a more detailed information on the system s scheduling mechanisms, as well as to extract useful insights about the application s structure. Figure 7 shows the scheduling information corresponding to an FDTD OpenCL [] application execution. As it can be seen, SchedMon is capable of monitoring all the information regarding when each of the application tasks enters or leaves a Central Processing Unit (CPU) (LPC). Since the underlying hardware contains 8 LPCs and the tested application is composed by 9 tasks, it is not possible to run all the tasks at the same time, at a given moment on all LPCs. In this specific test, the OS scheduler solves this issue by constantly migrating the task 79 from one core to another. For example, at around ms of the execution time, the task 79 is migrated from LPC to LPC 6. Another interesting phenomenon can be observed observed at around 9 seconds of the execution, where all tasks stop executing for about one second, with the sole exception of the leader task. This indicates that all the tasks are waiting either for resources or instructions from the main thread (786), thus showing the capabilities of SchedMon to provide insights on the application structure. ) Function Call Tracing: Figure 8 depicts the performance analysis of milc in time, and colors the samples according to the high-level function that is currently being executing SSE (ADD,MUL) / DBL (MAD) Roofline DBL (ADD,MUL)Roofline AVX (ADD,MUL) / SSE (MAD) Roofline add_constraint() make_constraint_data() make_fock_matrix() Operational Intensity [flops/bytes] 6 - add_constraint() make_constraint_data() make_fock_matrix() (a) CARM analysis according to the(b) Power analysis according to the call tracing profile. call tracing profile. Fig. 9. SchedMon evaluation of the SPEC CPU6 benchmark tonto. The sample time interval was set to ms As previously observed, milc presented several distinct phases and hence it was a preferred benchmark for this particular demonstration. As it can be observed in Figure 8, it is possible to extract a pattern from milc execution, where each the distinct performance phase corresponds to a different high-level function. This allows not only evaluating each execution part of a given application, but also detecting possible performance bottlenecks. ) CARM and Power Evaluation: In order to obtain a different perspective from the already obtained CARM and power consumption profiles for tonto, the information samples are now colored according to their function call tracing. In Figure 9a, it can be observed that tonto s high-level functions correspond to distinct CARM phases. The functions add_constraint() and make_constraint_data() present a similar behavior and attain a higher performance, whilst make_fock_matrix() presents a lower performance and it is contained in the memory-bound area. In contrast to Figure 6a, additional information can be obtained by analyzing the execution call tracing profile, which allows detecting and further optimize the possible bottlenecks of the application execution. Thus, tonto contains observable phases both in time and in the CARM. In comparison to SpyMon, it can be observed that the power consumption is reduced when using SchedMon as the monitoring tool. This can be explained by the fact that SchedMon does not create additional tasks for monitoring, i.e., it makes use of the available system running tasks in order to periodically read the energy status information. On the other hand, SpyMon is composed by 9 processes (monitor plus spies), which are actively monitoring the different LPCs at run-time, including those that are not currently running any application.

10 . PMU Sample Overhead (μs) SpyMon SchedMon Samping Time Interval (ms) (a) PMU RAPL Sample Overhead (μs) SpyMon SchedMon Sampling Time Interval (ms) (b) RAPL Fig.. Overhead of taking a PMU or a RAPL sample in both SpyMon and SchedMon tools. Instruc(ons per Sample ms ms ms ms ms ms ms Sampling Time Interval (a) SpyMon OT ST LD Instruc(ons per Sample ms ms ms ms ms ms ms Sampling Time Interval (b) SchedMon Fig.. Number of instructions per sample when self-monitoring. D. Overhead Discussion In order to perform a fair overhead analysis, both tools were instrumented with the rdtsc instruction and run in similar conditions in order to obtain the median overhead of taking a PMU and a RAPL sample. Figure illustrates the obtained results for both tools. In Figure a it is shown the overheads of taking a PMU sample and in Figure b it is shown the overheads of taking a RAPL sample. By analyzing Figure, it can be observed that SchedMon presents a lower overhead, in both cases. The overhead of producing a PMU sample is around.6µs in SchedMon, compared to an overhead of around.9µs in SpyMon. On the other hand, the introduced overhead for producing a RAPL sample is around.µs for SchedMon, compared to an overhead of around.µs in SpyMon. In addition, another evaluation test was performed, where each of the tools was run without the execution of any benchmarks. Figures a and b illustrate the number of obtained instruction counts per sample obtained by SpyMon and SchedMon, respectively. As it can be observed, SchedMon imposes lower overhead in terms of number of instructions (around ) than SpyMon (around ). Nonetheless, it is important to notice that these results account all the executed instructions during the tools execution, which might include additional counts from the OS. VI. CONCLUSION In this work two new tools for accurate application monitoring and characterization are proposed, which extract runtime information at different OS levels, namely: SPYMON at user-space level and SCHEDMON at kernel-space level (OS scheduler). These tools combine the hardware measurement facilities, available in modern multi-core architectures, OT ST LD with the Cache-Aware Roofline Model. This allows runtime characterization of application execution in terms of both performance and power/energy consumption and allows extracting important guidelines for application optimization. The experimental results presented in this paper show that both SPYMON and SCHEDMON provide accurate performance characterization of real-world applications. However, coreoriented characterization with SPYMON may result in an increased power consumption when monitoring cores that are not used by any of the running applications. On the other hand, SCHEDMON provides the multi-thread applications evaluation, with very low overheads. Despite these differences, both monitoring methods allow the user/programmer to get a detailed picture about the behavior of the application and how its execution is affected by the processor architectural features. ACKNOWLEDGMENT This work was supported by national funds through FCT Fundação para a Ciência e a Tecnologia, under the project PHSC - Stretching the Limits of Parallel Processing on Heterogenous Computing Systems under the reference PTDC/EEI-ELC//. REFERENCES [] Perf Wiki tutorial on perf. Accessed: -6-. [] Perfmon sourceforge project page. Accessed: -6-. [] Shirley Browne, Jack Dongarra, Nathan Garner, George Ho, and Philip Mucci. A portable programming interface for performance evaluation on modern processors. International Journal of High Performance Computing Applications, ():89,. [] William E Cohen. Tuning programs with oprofile. Wide Open Magazine, : 6,. [] Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. Linux device drivers. O Reilly Media, Inc.,. [6] John Demme and Simha Sethumadhavan. Rapid identification of architectural bottlenecks via precise event counting. In ACM SIGARCH Computer Architecture News, volume 9, pages 6. ACM,. [7] Jack Donnell. Java performance profiling using the vtune performance analyzer,. [8] Agner Fog. Software optimization resources. Accessed: --. [9] John L Henning. Spec cpu6 benchmark descriptions. ACM SIGARCH Computer Architecture News, (): 7, 6. [] Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. Cache-aware roofline model: Upgrading the loft.. [] Intel Intel. 6 and ia- architectures software developers manual. Volume : System Programming Guide,. [] Sverre Jarp, Ryszard Jurga, and Andrzej Nowak. Perfmon: A leap forward in performance monitoring. In Journal of Physics: Conference Series, volume 9, page 7. IOP Publishing, 8. [] Lidia Kuan, Pedro Tomas, and Leonel Sousa. A comparison of computing architectures and parallelization frameworks based on a twodimensional fdtd. In International Conference on High Performance Computing and Simulation (HPCS), pages 9 6. IEEE,. [] Mikael Pettersson. Perfctr: Linux performance monitoring counters driver. Retrieved Dec, 9. [] Jan Treibig, Georg Hager, and Gerhard Wellein. Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In Parallel Processing Workshops (ICPPW), 9th International Conference on, pages 7 6. IEEE,. [6] Vincent M Weaver. Linux perf event features and overhead. In The nd International Workshop on Performance Analysis of Workload Optimized Systems, FastPath, page 8,. [7] Vincent M Weaver, Matt Johnson, Kiran Kasichayanula, James Ralph, Piotr Luszczek, Daniel Terpstra, and Shirley Moore. Measuring energy and power with papi. In Parallel Processing Workshops (ICPPW), st International Conference on, pages IEEE,.