Performance and Energy Monitoring Tools for Modern Processor Architectures

Size: px
Start display at page:

Download "Performance and Energy Monitoring Tools for Modern Processor Architectures"

Transcription

1 Performance and Energy Monitoring Tools for Modern Processor Architectures Luís Taniça INESC-ID / IST - TU Lisbon Rua Alves Redol 9, -9 Lisbon, Portugal luis.tanica@ist.utl.pt Abstract Accurate on-the-fly characterization of application behavior requires assessing a set of execution-related parameters at run-time, including performance, power and energy consumption. These parameters can be obtained by relying on hardware measurement facilities built-in modern multi-core architectures, such as performance and energy counters. However, current operating systems do not provide the means to directly obtain these characterization data. Thus, the user needs to rely on complex custom-built libraries with limited capabilities, which might introduce significant execution and measurement overheads. In this work, we propose two different tools for efficient performance, power and energy monitoring of systems with modern multi-core CPUs, that allow capturing the run-time behavior of a wide range of applications at different system levels: i) at the user-space level, and ii) at kernel-level, by using the OS scheduler to directly capture this information. Although the importance of the proposed monitoring facilities is patent for many purposes, we focus herein on their employment for application characterization with the Cache-aware Roofline model. The experimental results show the capabilities of the proposed tools to deliver detailed and accurate information about the behavior of real-world applications on the underlying architectural resources. Moreover, they allow reconstructing and identifying the execution patterns of the profiled benchmarks from standard suites (SPEC CPU6), while introducing negligible overheads. I. INTRODUCTION Modern computing systems are complex heterogeneous platforms capable of sustaining high computing power. While in the past designers aimed at improving processing performance by applying power hungry techniques, e.g., by increasing the pipeline depth and, therefore, the overall working frequency, such techniques became unbearable due to the power wall. To overcome this issue, processor manufacturers turned to multi-core designs to increase performance, typically by replicating a number of identical cores on a single die. Each core includes a set of private coherent caches and dedicated execution engines, and they usually provide hardware support for multiple threads. In addition, all cores share the access to a common higher level memory organization, typically containing the last level cache and the main memory. Although these solutions allow increasing the overall processor performance, they also introduce additional complexity into the design, making it harder for application designers to fully exploit the available processing power. For example, the contention caused by multiple cores competing for the shared resources can drastically affect the execution efficiency. In addition to the these issues, current trends in computer architecture show that future processors and applications have to consider novel techniques to improve power and energy efficiency. In order to characterize and understand the behavior of such complex computational systems, we require accurate realtime monitoring tools. These allow, for example, to identify application and architectural efficiency bottlenecks for realcase scenarios, thus giving both the programmer and the computer architect hints on potential optimization targets. While many profiling tools have been developed in the last years, e.g., PAPI [] and OProfile [], it is not always easy to convert the acquired data into insightful information. This is particularly true for modern processors, which comprise very complex architectures, including deep memory hierarchy organizations, and for which several architectural events must be analyzed. Taking into account the complexity of modern processor architectures and the effects of having different applications running concurrently in multiple cores, the Cache-aware Roofline Model (CARM) [] was proposed, which unveils architectural details that are fundamental in nowadays application and architectural optimization. The CARM [] is a single-plot model that shows the practical limitations and capabilities for modern multi-core general-purpose architectures. It shows the attainable performance of a computer architecture as an upper-bound, by relating the peak floating-point performance (Flops/s), the operational intensity (Flops/byte), and the peak memory bandwidth for each cache level in the memory hierarchy (Bytes/s). The model is a single plot model that considers data traffic across both on-chip and off-chip memory domains, as it is perceived by the core. In this work two different monitoring methods are proposed, that combine the advantages of the recently proposed CARM, with real-time accurate monitoring facilities in a way that allows application developers to easily relate the application behavior with the architecture characteristics, thus fostering new application optimizations. The two different monitoring tools proposed herein rely on performance and power/energy hardware interfaces and are able to extract in real-time important power and performance characteristics of the running application, namely: SPYMON - The user-space tool that aims at a systemwide performance analysis. The main principle behind this tool relies on spawning a process to each processor core, which handles the profiling operations for that

2 system s component. This tool is intended to integrate both performance and power consumption monitoring, and it is provided to the end user via a simple to use interface; SCHEDMON - the tool that is mainly implemented from the kernel-space. Thus, it makes use of the OS internal scheduling events in order to detect context switching and to provide more accurate monitoring results. Similarly to SPYMON, this tool also provides all its functionality to the end user in an intuitive and easy-to-use command line interface. In addition, there is a possibility of performing a run-time evaluation, by means of a provided user-space library, which exports the kernel-space core functionality into user-space programs via a set of simple calls. Although there are several profiling tools available that allow to obtain performance or power consumption information, there are only a few that provide both functionalities in a single interface. In addition, even if a full performance configuration is provided, the choice of the proper performance events to monitor is not always trivial, nor the proper way of evaluating these, in order to obtain a complete overview of the application attainable performance on the underlying architecture resources. Finally, the ability to provide the full performance and power consumption evaluation must be passed to the end-user as an easy-to-use interface. However, some of the most powerful state-of-the-art performance interfaces are too complex [6] or not fully documented, which hampers their usage. The results reported in this paper illustrate the differences between the two proposed monitoring tools, and show the importance of using such monitoring techniques for the understanding and characterization of application execution on modern general-purpose processors. Overall, both monitoring methods are able to provide insightful information about the behavior of the applications and how its execution is affected by the processor architectural limitations. The remainder of this paper is organized as follows. Section II reviews the main infrastructures used by the proposed tools, and provides some insights on the state-of-the-art tools. Section III details the first proposed performance and power monitoring approach: SPYMON. Section IV details the other proposed tool: SCHEDMON. Section V presents experimental results, including comparisons between both tools and the analysis of set of SPEC CPU6 [9] benchmarks. Finally, Section VI presents a few concluding remarks. II. BACKGROUND Most modern processors contain Performance Monitoring Units (PMUs) that can be configured to count microarchitectural events such as clock cycles, retired instructions, branch miss-predictions and cache misses. To count these events, a small set of Model-Specific Registers (MSRs) is provided by each architecture, which limits the total number of events that can be simultaneously measured. For example, on the Intel Sandy Bridge and Ivy Bridge architectures, the PMU facility provides two main MSRs types []: Performance Monitoring Select Registers (PMSRs), which are used for configuring the events to monitor (count) in each corresponding counter; Performance Monitoring Counters (PMCs), which hold the actual counter value. These recent architectures extend the PMCs by adding a new type of counter, Performance Fixed Counter (PFC), which provides additional performance information without the need of any configuration. Thus, these counters can only be enabled and disabled and monitor events predefined by the architecture. On Intel architectures, the PMU does not provide power or energy consumption measurements. In order to assess this information, the Running Average Power Limit (RAPL) interface must be used. This allows simultaneously obtaining real-time energy consumption readings for several different domains, such as package, cores, uncore or Data Random- Access Memory (DRAM). A. Linux Kernel Modules Although PMU readings may be performed from user-space, configuring PMCs must be done from privilege level. RAPL energy status MSRs are not allowed to be written and should also need special permissions in order to be assessed. In Linux systems, there are only two different permission levels: i) the user-space, which comprises hardware privilege levels, and ; and ii) the kernel-space, which operates at privilege level. Therefore, in order to obtain the required permissions for handling the performance and/or power monitoring infrastructures, software interfaces must contain some component that runs in the kernel-space side. This can be achieve in two different ways: i) by modifying the OS kernel source, which requires the recompilation of the OS kernel; or ii by using a Linux kernel module, which is a piece of code that is allowed to be integrated into the Linux kernel, at run-time. The vast majority of Linux kernel modules is designated as a device driver, despite of being or not attached to a physical device []. The herein proposed tools make use of kernel modules, which although not connected to any kind of peripheral device whatsoever, may be logically seen as a way to access the physical hardware resources comprising performance and power/energy monitoring and, therefore, to call them drivers. B. State-of-art Monitoring Tools There are many options in the literature that provide access to hardware performance counters. In the case of Linux, one of the earliest was the perfctr patch [] for x86 processors. Perfctr provided a low latency memory-mapped interface to virtualized 6-bit counters on a per-process or per-thread basis. Later on, the perfmon [] interface was submitted to the kernel. When it became apparent that perfctr would not be accepted into the Linux kernel, perfmon was rewritten and generalized as perfmon [] to support a wide range of processors under Linux. After a continuing effort over several years by the performance community to get perfmon accepted into the Linux kernel, it too was rejected and supplanted

3 by yet another abstraction of the hardware counters, first called perf counters in kernel.6. and then perf events [6] in kernel.6.. Perf events is included in the Linux kernel, which makes it the preferable choice over the other available interfaces. The interface is built around file descriptors, allocated using the introduced system call sys_perf_event_open(). This system call returns a file descriptor representing a virtual performance counter. Events are specified at open time by using an elaborate perf_event_attr structure, which contains more than fields that can interact in complex ways. PMCs are enabled or disabled via ioctl() calls and their value can be read using a call to read(). Sampling can be enabled to periodically read the counters and write the values to a circular buffer, which must be allocated using mmap() call. Signals are sent to the process holding the referred file descriptors when new data is available. Although perf events has shown to be a quite powerful interface, it might be too complex for the common user. Moreover, it does not provide access to the RAPL interface. If one requires monitoring power along performance, a different interface has to be used. PAPI [] is one of the available tools that uses perf events. Its objective is to be highly portable by reusing the available Operating System (OS) performance interfaces, while allowing the inclusion of plug-ins to read other counters, such as those provided by NVIDIA Graphics Processing Units (GPUs). PAPI provides two interfaces to the underlying counter hardware: a simple, high-level interface and a fully-programmable low-level interface. The high-level interface only provides functions for starting, stopping and reading the counters. The low-level interface provides much more manageability and control over the available resources. Event multiplexing, multithread support, user callbacks on threshold and statistical profiling are some of the available functionalities. Recent versions of PAPI also include the possibility to measure power/energy consumption [7]. On the other hand, if a deep control over the available performance resources is needed, PAPI might not be the best way to do it, since it does not provide direct access to the performance unit but virtualizes it instead. If one is interested in a quick binary profiling, without having to write code to do it, Perf [] might be a more preferable choice. This is a profiling Linux command-line tool and one of the most referenced. It can be seen as an abstraction to the perf events interface, much more accessible to the common user. Perf provides a set of commands which allow not only profile but also to report profiling in a user-friendly way. It provides support for multi-threaded applications, event multiplexing and statistical profiling, among others. A processorwide mode is also available, allowing the user to profile not the application but the system itself. However, this tool lacks the possibility for power profiling, which obligates the search for other tools when energy information is a requirement. Yet another well-known resource is OProfile [], which is composed by a Linux kernel driver, a daemon and perflike command line tool. OProfile s kernel driver is meant for abstracting the performance hardware registers and dump the sampling information at regular intervals. The daemon can be started and stopped by the user and it is responsible for consuming the profiling information provided by the kernel driver and save it in OProfile s sampling database. This database can later be accessed by the user to extrapolate useful profiling information by using the command-line available tools, like opreport. Although this tool appears to be complete in terms of performance, it still lacks the functionality for providing energy status information. There are several other profiling tools available, like Intel VTune Performance Analyzer [7], LIKWID [] or LIMIT [6]. The choice of the right tool is not always trivial and relies mostly on the user needs. For instance, one may require higher abstraction, lower overhead, higher control or more information detail. III. USER-SPACE MONITORING TOOL: SPYMON SPYMON s main goal is to provide a portable tool with an intuitive interface for the end-user, without relying on the underlying OS s monitoring facilities. Hence, most of SPYMON s implementation lies in the user-space, in order not to interfere or depend on the running system. SPYMON targets a core-oriented approach, by monitoring the behavior of each Logical Processor Core (LPC) and therefore being able to capture the information of all running applications. As a result SPYMON allows monitoring the whole system, regardless of what is running at a given time instant on each LPC. This means that, even if the application migrates to another core, launches new threads, or its execution is constrained by the contention caused by other running applications, SPYMON is able to capture all these execution events. A. Architecture The herein proposed tool is composed by three main parts that interact in a hierarchical way, namely i) the monitor, which controls the tool s execution flow and provides all the functionality to the user; ii) a set of spies, which are responsible for communication with the PMU interface and for handling the performance profiling information; and iii) a linux kernel module, which provides the access to the hardware facilities, thus overcoming any privilege access restrictions. SpyMon s module, or driver, is meant for overcoming the privilege restrictions that are usually associated with the hardware monitoring interfaces [8] and is composed by i) a small number of structures for communicating with the tool; ii) the addresses of the underlying performance and power MSRs; and iii) a set of functions that operate over these data structures and allow reading from and writing to the underlying hardware. At the time of the module s installation, a new device file is created in the /dev directory, allowing the communication between the user-space processes and the driver. The module is accessed by both the monitor and the spies, by relying on the ioctl() system call over the device file. By using this call, the tool is not only able to send

4 Physical Core Physical Core Physical Core Physical Core LPC LPC LPC LPC LPC LPC6 LPC LPC7 App. App. App. App. App. (Thread ) (Thread ) (Thread ) (Thread ) (Thread ) Start Start L L SpyMon monitor L L SpyMon spy L Fig.. Spacial perception of SpyMon while monitoring threads from applications. a specific command to the module, but also to specify an argument, which is used to send the proper data structures, either for holding the sample readings or for configuration purposes. The monitor process does not only control the flow the tool s execution flow, translating its functionality to the user as a simple command-line interface, but it is also responsible for monitoring power/energy consumption, when required. On the other hand, each spy is launched with the purpose of handling the performance evaluation on a single LPC. The typical SpyMon configuration is to launch a spy to monitor the performance of each available LPC and to pin the monitor to the last one, as shown in Figure. In the illustrated example, the monitor forks 8 new processes (spies) and pins each of them to a different LPC. By default, the monitor process is pinned to the last LPC, but different configurations are possible. The spies are responsible for handling the communication with the PMU, in order to output the obtained PMU samples. Since this work relies on facilities for monitoring energy consumption at the level of the whole chip (RAPL), the monitor is responsible for the communication with these facilities and, therefore, for reading the energy status information. B. Execution and Implementation When started, the tool firstly parses the input parameters (step ). If the --help sub-command is provided, the usage information will be printed to the standard output (step 8). If the --list sub-command is provided, then the complete list of available hardware events (step 9) is shown. On the other hand, if either --start or --roof argument is provided, then monitoring parameters are configured according to the user s input specifications, and the application profiling is initiated. In brief, --start activates the most commonly used SpyMon profiling mode, while --roof enables runtime cache-aware roofline application monitoring. When the ----start command is provided, the tool firstly parses and verifies the input parameters. Then, the monitor process is pinned to a specific LPC (step ), by using the sched_setaffinity() system call. This call allows informing the scheduler in which LPCs is the calling thread allowed to execute in. By default, the monitor is pinned to the last available LPC, although its affinity can be changed by the end-user in the initial tool configuration. L L L L help Parse Options roof command () Pin monitor Display usage (8) Display predefined (9) information () Launch & events pin spies no Apps yes Send PMU config -p options Get RAPL sample Stop spies and terminate End start no Output RAPL sample () () yes (6) (7) list () Launch & pin apps (a) Monitor execution diagram. Fig.. SpyMon s execution flow. no Set comm with monitor Get PMU config Configure evset Wait mux interval (e) Read PMU counters evsets > yes Output PMU sample terminate End no yes (a) (b) (c) (d) (g) (f) Configure next evset no sample yes (b) Spy execution diagram. Afterwards, the main process forks several new processes (spies), which number corresponds to the number of required target monitoring cores (step ). By default, all LPCs are monitored, i.e., SpyMon firstly detects the number of available LPCs and launches one spy for each LPC. The general execution diagram for a single spy process is depicted in Figure b and it starts by setting the pipe communication channel with the monitor (step a). Following the process spatial configuration, the PMU configurations are made (step ). After parsing the provided Performance Monitoring Event (PME) configuration, SpyMon creates a number of event-sets by grouping the events according to the number of available PMCs. For instance, if the architecture only supports PMCs and the user provides 7 PMEs, then the tool will define event-sets, where the first event-set contains the first provided PMEs and the second one contains the remaining PMEs. When the PMU configuration is done, the monitor sends the configuration structures to the spies, by means of the previously established pipe communication channels. In the spy execution diagram (see Figure b), this corresponds to step b. From this point on, each spy starts monitoring its target LPC and producing the performance sampling information accordingly. If more than one event-set is defined, then event multiplexing is applied. In these cases, a single performance sample corresponds to a specific predefined time interval in which the performance counter readings from different event-sets are merged together. This mechanism is performed by the spies, since they are responsible for handling the PMU. When performing a system-wide evaluation, it is usually required to launch specific applications and analyze their performance, as specified by the end-user. SpyMon provides this functionality via a set of simple configuration commands,

5 that instructs not only the target applications launches, but also pinning their execution to the required LPCs (step ). This is achieved by using the fork() and execve() system calls. When all the initializations and configurations are performed, SpyMon initiates profiling. At this point, the monitor process also starts reading and producing RAPL sampling information (steps 6-7), until the monitored application terminates. At the same time, each spy reads and produces PMU information samples at regular intervals (steps d-g). As it can be observed, as soon as the performance counter readings are retrieved for the current event-set (step e), each spy activates the event multiplexing if the number of eventset is greater than (evsets > ), or immediately outputs the counter readings if not (step g). When event multiplexing is activated, the next event-set is configured (step f), i.e., a set of different events will start being counted during a multiplexing time interval (step d). When the last event-set counts are retrieved, the sample is considered to be (sample ) and its contents are outputted (step g). The described process (steps d-g) is repeated one time for each sampling time interval, until monitoring is completed. The information produced by both the monitor and the spies is directly printed out to files. The number of files corresponds to the number of processes executed inside the tool, i.e., to the number of spies plus the monitor, where each file corresponds to only one of those processes. C. Usage As already referred, SpyMon provides main subcommands: --help, --list, --start and --roof. The set of supported options can be retrieved with the --help parameter, which also provides a short summary on how to use SpyMon s interface with different options. The --list option outputs a list of the predefined hardware events provided by the tool. The --start allows defining the events to monitor (which can be predefined or raw events), the PFC configuration, the target cores for monitoring, enabling power consumption information, define the sampling time interval and provide target applications for being monitored. The --roof only provides the ability to define the sampling time interval, enabling power consumption information and providing the target application. The event-sets used for obtaining the predefined CARM analysis are shown in Table I. As it can be observed, it is required to monitor 6 different events in order to assess the number of Floating Point (FP) operations, and additional events to estimate the amount of data traffic. As a result, event multiplexing is required and the event-set configuration is made according to the information shown in Table I. IV. SCHEDULER-BASED MONITORING TOOL: SCHEDMON In contrast to SpyMon, SchedMon is mostly implemented from the kernel-space and aims at an application-based evaluation, thus allowing a deep profiling of the monitored applications. SchedMon main principles rely upon its modularity, which allow to easily extend its functionality. In order to User-space Kernel-space Hardware Smon System calls SchedMon s device SchedMon s Library SpyMon s kernel module RAPL Shared memory Fig.. SchedMon s components interaction and disposition in the OS privilege layers. achieve this, the tool is designed not to depend on the available OS performance interfaces (e.g., perf events), i.e., it does not rely on any already implemented structure and functionality. A. Architecture SchedMon is composed by two main parts, namely: i) a Linux kernel module, or driver, which integrates the tool s implementation core mechanisms; and ii) a user-space tool (smon), which extracts the whole functionality of the underlying module and translates it into a simple and intuitive user interface. The communication between both components is made by means of a specific developed user-space library, which provides a set of functions that allow handling the tool s main functionalities. Figure illustrates the interaction between SchedMon s components, as well as the their disposition in the OS privilege layers. As it can be observed, the Linux kernel module is responsible for interacting with the hardware, thus providing the necessary performance and power/energy consumption information, via PMU and RAPL facilities. The communication between the module and the user-space tool is made through a set of system calls over the driver s device file. These necessary communication commands are provided by the tool s user-space library. In addition to the commandrelated communication, a shared memory area is used for exchanging the produced profiling information, at run-time. B. SCHEDMON s Linux Kernel Module SchedMon s kernel module, or driver, is the main component of the tool, since it provides the main functionality and holds all data structures implementations. The driver, when loaded into the kernel, creates a file in the /dev directory (the device) which acts as a communication medium to the driver, i.e., the operations over this file trigger the corresponding module s function to handle that specific operation. The main operations over SchedMon s device file include: PMU ioctl - allows configuring new PMEs, registering new tasks for monitoring and consuming profiling information from the shared memory from the user-space; mmap - allows initializing the shared memory between the kernel and the user-space, which is used for exchanging the profiling information samples;

6 TABLE I SETS OF PMES USED FOR PERFORMANCE PROFILING WHEN USING THE CACHE-AWARE ROOFLINE MODEL. Event Set Event Description FP_SSE_PACKED_SINGLE Number of SSE single-precision FP packed µops executed. FP_SSE_PACKED_DOUBLE Number of SSE double-precision FP packed µops executed. FP_AVX_PACKED_SINGLE Number of AVX 6-bit packed single-precision FP instructions executed. FP_AVX_PACKED_DOUBLE Number of AVX 6-bit packed double-precision FP instructions executed. FP_SSE_SCALAR_SINGLE Number of SSE single-precision FP scalar µops executed. FP_SSE_SCALAR_DOUBLE Number of SSE double-precision FP scalar µops executed. MEM_UOP_RETIRED_ALL_LOADS Qualify any retired memory µops that are loads. MEM_UOP_RETIRED_ALL_STORES Qualify any retired memory µops that are stores. SSE - Streaming SIMD Extensions; FP - floating-point; AVX - Advanced Vector Extensions; µops - micro-operations poll - implements the synchronization mechanisms used by SchedMon in order to coordinate the read and write operations over the previously allocated shared memory. By relying on the above described calls, the full control and configuration of SchedMon s driver from the user-space can be attained. The following text described the main structures and mechanisms implemented by SchedMon in order to achieve its full funcionality. ) Events, Event-sets and Environment: SchedMon keeps all the profiling configurations inside the driver. The three main structures used for keeping track of the registered performance configurations are: events - the structure that holds a specific PME configuration; event-sets - structures used to aggregate a number of events into several groups, that should be configured into the PMU; environment - this structure is temporarily created at the time of the monitoring execution and contains all the profiling configurations for that specific run, e.g., the required profiling information, sampling time interval and event-sets to monitor. This structure hierarchy allows not only reusing the same event configurations across different event-sets, but also reusing the same event-sets across distinct runs. ) Monitored Tasks: SchedMon defines two types of tasks: leaders and children. In order to monitor an application using the tool, the target process, or thread, must be registered into the driver. For this, an ioctl() system call with the proper request must be performed. The task registration request requires two distinct arguments: the target PID, which is the task identification parameter, and an environment data structure containing the profiling configuration. Under SchedMon s driver, every registered task is appointed as a leader. On the other hand, a child corresponds to a task descending from a leader. This only applies if the inherit option is enabled upon the leader task registration, otherwise the driver will not register any children descending from that task. Each leader task, which is registered in the driver, is associated with a performance environment, i.e., a data structure containing the profiling execution configuration. Whenever a child is allocated by the driver, it inherits its leader performance environment and, therefore the same configuration. ) Scheduling Infrastructures: SchedMon s driver makes use of the OS scheduler tracepoints in order to attain the full control over the monitored tasks execution, and it is driven by the following OS scheduling events: sched_process_exec() - this event is triggered whenever a task performs an execve() system call and it is used by SchedMon in order to start monitoring a target task from the exact beginning of its execution; sched_process_fork() - this triggered whenever a task forks another task and it is used by the tool s driver in order to enable monitoring of multi-threaded applications; sched_switch() - this event is triggered in a specific LPC whenever the Linux scheduler switches the current executing task by another one. SchedMon uses this event to detect when a monitored task is scheduled in or scheduled out from a specific LPC in order to perform the necessary PMU context switch; sched_migrate_task() - this event is triggered whenever a task migrates from a specific LPC to another. This is used by SchedMon to facilitate the search for a monitored task; sched_process_exit() - this event is triggered whenever a task terminates its execution and it is used to detect the termination of monitored tasks. ) Sample Types: SchedMon s driver currently provides five different types of samples, that can be enabled through the environment structure at the time of the registration of a target task: PMU - these samples contain the performance information and are always enabled by default; RAPL - these samples contain the energy status information obtained from the RAPL interface; migration - a migration sample is provided each time the monitored task (or tasks) migrates to a different LPC; fork - a fork sample is provided whenever a monitored multi-threaded application forks/creates a new task; scheduling - scheduling samples contain the information when the task was scheduled in and scheduled out from a specific LPC, by the Linux scheduler. ) Sampling: Sampling refers to the process of extracting specific information from the execution at regular time intervals. In order to provide accurate performance sampling, several auxiliary data structures are used for this process.

7 The main data structures used for performance sampling are: i) the array containing the different event-set configurations; ii) a Linux high-resolution timer for synchronization purposes and sampling at the nanosecond granularity; and iii) a temporary PMU sample, which holds the current sample counts. Thus, each time a monitored task is executed in a specific LPC, the corresponding event-set is configured into the PMU and the timer is set to trigger after the current sample remaining time. When the timer is triggered the PMU information is obtained and a performance sample is produced. When RAPL sampling information is enabled, a distinct high-resolution timer is set to trigger at regular time intervals. Each time the timer is triggered, the energy status information is obtained and provided to the user-space as a RAPL sample. 6) Kernel-User Communication: As previously referred, the profiling information is exchanged between the kernel and user-space by means of shared memory. The mechanism used by SchedMon to exchange profiling information with the user-space comprises: i) a memory ring-buffer; ii) a virtual memory area, shared between the kernel and the user-space; and iii) a synchronization procedure. The virtual memory area is allocated by the user-space process (smon) by means of the mmap() system call, before starting the profiling. When SchedMon s driver detects a mmap() call, it creates a shared memory ring-buffer, which will operate as the communication medium between the driver and smon. The synchronization is performed by means of the poll() system call. Hence, whenever the required amount of samples is available for consuming, the driver signals the user-space process. The userspace process can then access the corresponding sampling information and, after that, it alerts the driver by using the ioctl() system call. C. User-space Tool The SchedMon s user-space component, smon, is integrated in the tool in order to facilitate the access and handling of the underlying driver. By relying on the driver s user-space library for configuration and the mmap() and poll() system calls, smon translates the whole tool s functionality into an easy to use command-line interface. The main functionalities of smon include: i) the creation of events; ii) the definition of event-sets, by using the already created events; and iii) the ability of profiling an application. Function call tracing represents the process of detecting whenever a target application (the tracee) enters or leaves a function call. This is an important feature for detecting the potential execution bottlenecks for the most time consuming parts of the application. The method used by smon to detect the entering and returning points of a function requires preprocessing the dumped assembly code of the application. The detected execution points are then assigned to breakpoint structures, which hold the original bytes contained in those positions and are used to inject code to the same memory addresses. SchedMon is able to trace the target task s function calls by recurring to the ptrace() system call and by injecting a trap instruction in each breakpoint. As a result, the tool is able to track whenever a breakpoint is hit. Smon is also able to detect whenever a new process is forked or switch its execution image, thus allowing call tracing multi-threaded applications, even when different tasks execute distinct binaries. D. Usage SchedMon provides a similar interface to SpyMon, that translates the full functionality of the tool to the end-user in a simple and intuitive way. The interface is composed by main commands: smon-event - this command allows to add new PME configurations into the tool; smon-evset - this command allows to add new eventsets to the tool. Each event-set must be composed by a number of already defined events; profile - this command allows profiling a provided target application. In addition, it provides several options for configuring distinct profiling parameters (e.g., sampling time interval, the required sample types, the sharedmemory size and the event-sets to be monitored); smon-roof-run - this command allows performing a predefined performance evaluation of the target application according to the CARM. In order to achieve this, SchedMon relies on the same predefined event-sets as SpyMon (see Table I). A. Experimental Environment V. EVALUATION RESULTS The herein presented results were executed in a machine containing an Intel i7 77K processor, which is an Ivy Bridge based micro-architecture with physical cores and with hyper-threading support, i.e., 8 LPC. It operates at.ghz, although it can attain.9ghz in turbo boost mode, and its memory organization comprises cache levels of kb, 6kB and 89kB, respectively. The cache levels L and L are shared between the LPC contained in the same PPC, and the last-level cache, L, is shared between all the cores. The DRAM memory controllers support up to two channels of DDR operating at x9m Hz. For executing the following tests, the processor s clock was set to continuously run at a fixed frequency of.ghz. Moreover, the machine provides a PMU containing PFC and PMC and a RAPL interface. B. SpyMon ) System-wide Analysis: Figure illustrates a performance evaluation of four distinct SPEC CPU6 benchmarks (milc, namd, GemsFDTD and tonto). In order to obtain the depicted results, each benchmark test was executed individually, without the interference of any other applications (with the exception of the OS tasks). For each execution, the benchmark process was pinned to its corresponding LPC, as shown in Figure. Each of the shown LPCs was chosen in order to belong to a distinct Physical Processor Core (PPC). After running each of the four tests individually, a final run was performed, in which all the four tests were run at the same

8 (c) Namd alone (core ) (b) Namd running alone (core ) (a) Milc running alone (core ) (b) Milc with others (core ) (a) Milc alone (core ) (d) Namd with others (core ) (e) GemsFDTD alone (core ) (g) Tonto alone (core ) (d) Tonto running alone (core ) time. The obtained results are presented in Figures b, d, f and h. In each of the runs, the sampling time interval was set to ms. By analyzing Figure, several informational details can be extracted. First, all the benchmarks achieve lower performance when run alongside each other, due to a shared resource contention. This conclusion can be taken by observing that i) each benchmark duration is longer when run alongside others, and ii) each benchmark performance values are significantly lower. Another interesting observation relies on the shapes of the obtained plots, where different parts of the execution can be detected. For instance, when running the milc benchmark alone (see Figure a), at least three distinct execution phases can be identified, where each of them ocurrs in refulat time intervals and delivers different attainable performance (in GF lops/s). However, when run together, the shapes of each benchmark execution appear to change according to the concurrent applications. For example, the shape of the GemsFDTD benchmark is completely distorted when run with the other applications (see Figures e and f). Figure depicts the experimentally obtained power consumption for the above described test conditions. The plotted information corresponds to the package domain, i.e., it represents the power consumption of the whole chip. When each milc (h) Tonto with others (core ) Fig.. SpyMon performance evaluation of SPEC CPU6 benchmarks, for a ms sampling time interval. tonto namd GemsFDTD (e) Four benchmarks run simultaneously. Fig.. Power consumption of four benchmarks run separately and simultaneously. AVX (ADD,MUL) / SSE (MAD) Roofline (f) GemsFDTD with others (core ) (c) GemsFDTD running alone (core ) 8 SSE (ADD,MUL) / DBL (MAD) Roofline DBL (ADD,MUL)Roofline Operational Intensity [flops/bytes] - DBL SSE AVX (a) CARM analysis according to the FP(b) Power analysis according to the FP types. types. Fig. 6. SpyMon evaluation of the SPEC CPU6 benchmark tonto. The sample time interval was set to ms benchmark is executed alone (see Figures a, b, c and d), the chip power consumption is around W. As it can be observed, the power consumption does not only depend on the core being activated or not, but also on the resource utilization. For instance, as shown in Figure a the power consumption assumes a shape similar to the one observed in the milc performance profile (see Figure a). On the other hand, Figure e shows the power consumption when all benchmarks were simultaneously executed. A it can be observed, each additional activated LPC corresponds to an increment of approximately W in the system s power consumption (see Figure e). ) CARM an Power Evaluation: Figure 6 illustrates the CARM and power consumption evaluation of the tonto bench-

9 CPU 6 79 CPU CPU 796 CPU 787 CPU CPU CPU grsource_imp() imp_gauge_force() ks_congrad() eo_fermion_force() 786 CPU Time [sec] 786 Fig. 8. Milc performance colored according to its function call tracing profile. Fig. 7. Scheduling information for OpenCL application fdtd. mark. Figure 6a contains the CARM information for tonto. This test presents two distinct performance parts, which contain the predominant scalar and Streaming SIMD Extensions (SSE) FP types, respectively. As it can by observed in Figure 6b, the two distinct parts of execution are interchangeably switching in time. During the parts of the execution corresponding to the scalar instructions (DBL), one can conclude that tonto is mainly memory-bound, since it is at the left side of its correspondent ridge point, both for ADD/MULL and MAD roof lines. In fact, Figure proves that these zones of the execution are memory dependent and inflict changes in the performance shapes of applications running alongside. On the other hand, when executing SSE instructions, it is considered to be more compute-bound. C. SchedMon ) Multi-thread Scheduling Informaiton: SchedMon allows not only to detect and monitor multi-threaded applications, but it also provides the means to analyze the scheduling route of each task s execution. This allows to obtain a more detailed information on the system s scheduling mechanisms, as well as to extract useful insights about the application s structure. Figure 7 shows the scheduling information corresponding to an FDTD OpenCL [] application execution. As it can be seen, SchedMon is capable of monitoring all the information regarding when each of the application tasks enters or leaves a Central Processing Unit (CPU) (LPC). Since the underlying hardware contains 8 LPCs and the tested application is composed by 9 tasks, it is not possible to run all the tasks at the same time, at a given moment on all LPCs. In this specific test, the OS scheduler solves this issue by constantly migrating the task 79 from one core to another. For example, at around ms of the execution time, the task 79 is migrated from LPC to LPC 6. Another interesting phenomenon can be observed observed at around 9 seconds of the execution, where all tasks stop executing for about one second, with the sole exception of the leader task. This indicates that all the tasks are waiting either for resources or instructions from the main thread (786), thus showing the capabilities of SchedMon to provide insights on the application structure. ) Function Call Tracing: Figure 8 depicts the performance analysis of milc in time, and colors the samples according to the high-level function that is currently being executing SSE (ADD,MUL) / DBL (MAD) Roofline DBL (ADD,MUL)Roofline AVX (ADD,MUL) / SSE (MAD) Roofline add_constraint() make_constraint_data() make_fock_matrix() Operational Intensity [flops/bytes] 6 - add_constraint() make_constraint_data() make_fock_matrix() (a) CARM analysis according to the(b) Power analysis according to the call tracing profile. call tracing profile. Fig. 9. SchedMon evaluation of the SPEC CPU6 benchmark tonto. The sample time interval was set to ms As previously observed, milc presented several distinct phases and hence it was a preferred benchmark for this particular demonstration. As it can be observed in Figure 8, it is possible to extract a pattern from milc execution, where each the distinct performance phase corresponds to a different high-level function. This allows not only evaluating each execution part of a given application, but also detecting possible performance bottlenecks. ) CARM and Power Evaluation: In order to obtain a different perspective from the already obtained CARM and power consumption profiles for tonto, the information samples are now colored according to their function call tracing. In Figure 9a, it can be observed that tonto s high-level functions correspond to distinct CARM phases. The functions add_constraint() and make_constraint_data() present a similar behavior and attain a higher performance, whilst make_fock_matrix() presents a lower performance and it is contained in the memory-bound area. In contrast to Figure 6a, additional information can be obtained by analyzing the execution call tracing profile, which allows detecting and further optimize the possible bottlenecks of the application execution. Thus, tonto contains observable phases both in time and in the CARM. In comparison to SpyMon, it can be observed that the power consumption is reduced when using SchedMon as the monitoring tool. This can be explained by the fact that SchedMon does not create additional tasks for monitoring, i.e., it makes use of the available system running tasks in order to periodically read the energy status information. On the other hand, SpyMon is composed by 9 processes (monitor plus spies), which are actively monitoring the different LPCs at run-time, including those that are not currently running any application.

10 . PMU Sample Overhead (μs) SpyMon SchedMon Samping Time Interval (ms) (a) PMU RAPL Sample Overhead (μs) SpyMon SchedMon Sampling Time Interval (ms) (b) RAPL Fig.. Overhead of taking a PMU or a RAPL sample in both SpyMon and SchedMon tools. Instruc(ons per Sample ms ms ms ms ms ms ms Sampling Time Interval (a) SpyMon OT ST LD Instruc(ons per Sample ms ms ms ms ms ms ms Sampling Time Interval (b) SchedMon Fig.. Number of instructions per sample when self-monitoring. D. Overhead Discussion In order to perform a fair overhead analysis, both tools were instrumented with the rdtsc instruction and run in similar conditions in order to obtain the median overhead of taking a PMU and a RAPL sample. Figure illustrates the obtained results for both tools. In Figure a it is shown the overheads of taking a PMU sample and in Figure b it is shown the overheads of taking a RAPL sample. By analyzing Figure, it can be observed that SchedMon presents a lower overhead, in both cases. The overhead of producing a PMU sample is around.6µs in SchedMon, compared to an overhead of around.9µs in SpyMon. On the other hand, the introduced overhead for producing a RAPL sample is around.µs for SchedMon, compared to an overhead of around.µs in SpyMon. In addition, another evaluation test was performed, where each of the tools was run without the execution of any benchmarks. Figures a and b illustrate the number of obtained instruction counts per sample obtained by SpyMon and SchedMon, respectively. As it can be observed, SchedMon imposes lower overhead in terms of number of instructions (around ) than SpyMon (around ). Nonetheless, it is important to notice that these results account all the executed instructions during the tools execution, which might include additional counts from the OS. VI. CONCLUSION In this work two new tools for accurate application monitoring and characterization are proposed, which extract runtime information at different OS levels, namely: SPYMON at user-space level and SCHEDMON at kernel-space level (OS scheduler). These tools combine the hardware measurement facilities, available in modern multi-core architectures, OT ST LD with the Cache-Aware Roofline Model. This allows runtime characterization of application execution in terms of both performance and power/energy consumption and allows extracting important guidelines for application optimization. The experimental results presented in this paper show that both SPYMON and SCHEDMON provide accurate performance characterization of real-world applications. However, coreoriented characterization with SPYMON may result in an increased power consumption when monitoring cores that are not used by any of the running applications. On the other hand, SCHEDMON provides the multi-thread applications evaluation, with very low overheads. Despite these differences, both monitoring methods allow the user/programmer to get a detailed picture about the behavior of the application and how its execution is affected by the processor architectural features. ACKNOWLEDGMENT This work was supported by national funds through FCT Fundação para a Ciência e a Tecnologia, under the project PHSC - Stretching the Limits of Parallel Processing on Heterogenous Computing Systems under the reference PTDC/EEI-ELC//. REFERENCES [] Perf Wiki tutorial on perf. Accessed: -6-. [] Perfmon sourceforge project page. Accessed: -6-. [] Shirley Browne, Jack Dongarra, Nathan Garner, George Ho, and Philip Mucci. A portable programming interface for performance evaluation on modern processors. International Journal of High Performance Computing Applications, ():89,. [] William E Cohen. Tuning programs with oprofile. Wide Open Magazine, : 6,. [] Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. Linux device drivers. O Reilly Media, Inc.,. [6] John Demme and Simha Sethumadhavan. Rapid identification of architectural bottlenecks via precise event counting. In ACM SIGARCH Computer Architecture News, volume 9, pages 6. ACM,. [7] Jack Donnell. Java performance profiling using the vtune performance analyzer,. [8] Agner Fog. Software optimization resources. Accessed: --. [9] John L Henning. Spec cpu6 benchmark descriptions. ACM SIGARCH Computer Architecture News, (): 7, 6. [] Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. Cache-aware roofline model: Upgrading the loft.. [] Intel Intel. 6 and ia- architectures software developers manual. Volume : System Programming Guide,. [] Sverre Jarp, Ryszard Jurga, and Andrzej Nowak. Perfmon: A leap forward in performance monitoring. In Journal of Physics: Conference Series, volume 9, page 7. IOP Publishing, 8. [] Lidia Kuan, Pedro Tomas, and Leonel Sousa. A comparison of computing architectures and parallelization frameworks based on a twodimensional fdtd. In International Conference on High Performance Computing and Simulation (HPCS), pages 9 6. IEEE,. [] Mikael Pettersson. Perfctr: Linux performance monitoring counters driver. Retrieved Dec, 9. [] Jan Treibig, Georg Hager, and Gerhard Wellein. Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In Parallel Processing Workshops (ICPPW), 9th International Conference on, pages 7 6. IEEE,. [6] Vincent M Weaver. Linux perf event features and overhead. In The nd International Workshop on Performance Analysis of Workload Optimized Systems, FastPath, page 8,. [7] Vincent M Weaver, Matt Johnson, Kiran Kasichayanula, James Ralph, Piotr Luszczek, Daniel Terpstra, and Shirley Moore. Measuring energy and power with papi. In Parallel Processing Workshops (ICPPW), st International Conference on, pages IEEE,.

KerMon: Framework for in-kernel performance and energy monitoring

KerMon: Framework for in-kernel performance and energy monitoring 1 KerMon: Framework for in-kernel performance and energy monitoring Diogo Antão Abstract Accurate on-the-fly characterization of application behavior requires assessing a set of execution related parameters

More information

STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION

STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION 1 DIPAK PATIL, 2 PRASHANT KHARAT, 3 ANIL KUMAR GUPTA 1,2 Depatment of Information Technology, Walchand College of

More information

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging

More information

An OS-oriented performance monitoring tool for multicore systems

An OS-oriented performance monitoring tool for multicore systems An OS-oriented performance monitoring tool for multicore systems J.C. Sáez, J. Casas, A. Serrano, R. Rodríguez-Rodríguez, F. Castro, D. Chaver, M. Prieto-Matias Department of Computer Architecture Complutense

More information

A Study of Performance Monitoring Unit, perf and perf_events subsystem

A Study of Performance Monitoring Unit, perf and perf_events subsystem A Study of Performance Monitoring Unit, perf and perf_events subsystem Team Aman Singh Anup Buchke Mentor Dr. Yann-Hang Lee Summary Performance Monitoring Unit, or the PMU, is found in all high end processors

More information

Coping with Complexity: CPUs, GPUs and Real-world Applications

Coping with Complexity: CPUs, GPUs and Real-world Applications Coping with Complexity: CPUs, GPUs and Real-world Applications Leonel Sousa, Frederico Pratas, Svetislav Momcilovic and Aleksandar Ilic 9 th Scheduling for Large Scale Systems Workshop Lyon, France July

More information

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda Objectives At the completion of this module, you will be able to: Understand the intended purpose and usage models supported by the VTune Performance Analyzer. Identify hotspots by drilling down through

More information

Perfmon2: A leap forward in Performance Monitoring

Perfmon2: A leap forward in Performance Monitoring Perfmon2: A leap forward in Performance Monitoring Sverre Jarp, Ryszard Jurga, Andrzej Nowak CERN, Geneva, Switzerland Sverre.Jarp@cern.ch Abstract. This paper describes the software component, perfmon2,

More information

Performance Monitoring of the Software Frameworks for LHC Experiments

Performance Monitoring of the Software Frameworks for LHC Experiments Proceedings of the First EELA-2 Conference R. mayo et al. (Eds.) CIEMAT 2009 2009 The authors. All rights reserved Performance Monitoring of the Software Frameworks for LHC Experiments William A. Romero

More information

Perf Tool: Performance Analysis Tool for Linux

Perf Tool: Performance Analysis Tool for Linux / Notes on Linux perf tool Intended audience: Those who would like to learn more about Linux perf performance analysis and profiling tool. Used: CPE 631 Advanced Computer Systems and Architectures CPE

More information

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems Carsten Emde Open Source Automation Development Lab (OSADL) eg Aichhalder Str. 39, 78713 Schramberg, Germany C.Emde@osadl.org

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Using PAPI for hardware performance monitoring on Linux systems

Using PAPI for hardware performance monitoring on Linux systems Using PAPI for hardware performance monitoring on Linux systems Jack Dongarra, Kevin London, Shirley Moore, Phil Mucci, and Dan Terpstra Innovative Computing Laboratory, University of Tennessee, Knoxville,

More information

Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations

Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations Power Capping Linux Agenda Context System Power Management Issues Power Capping Overview Power capping participants Recommendations Introduction of Linux Power Capping Framework 2 Power Hungry World Worldwide,

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

Data Structure Oriented Monitoring for OpenMP Programs

Data Structure Oriented Monitoring for OpenMP Programs A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs Edmond Kereku, Tianchao Li, Michael Gerndt, and Josef Weidendorfer Institut für Informatik, Technische Universität München,

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

End-user Tools for Application Performance Analysis Using Hardware Counters

End-user Tools for Application Performance Analysis Using Hardware Counters 1 End-user Tools for Application Performance Analysis Using Hardware Counters K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, T. Spencer Abstract One purpose of the end-user tools described in

More information

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab Performance monitoring at CERN openlab July 20 th 2012 Andrzej Nowak, CERN openlab Data flow Reconstruction Selection and reconstruction Online triggering and filtering in detectors Raw Data (100%) Event

More information

On the Importance of Thread Placement on Multicore Architectures

On the Importance of Thread Placement on Multicore Architectures On the Importance of Thread Placement on Multicore Architectures HPCLatAm 2011 Keynote Cordoba, Argentina August 31, 2011 Tobias Klug Motivation: Many possibilities can lead to non-deterministic runtimes...

More information

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring. David Goodwin NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study CS 377: Operating Systems Lecture 25 - Linux Case Study Guest Lecturer: Tim Wood Outline Linux History Design Principles System Overview Process Scheduling Memory Management File Systems A review of what

More information

Full and Para Virtualization

Full and Para Virtualization Full and Para Virtualization Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF x86 Hardware Virtualization The x86 architecture offers four levels

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

The Intel VTune Performance Analyzer

The Intel VTune Performance Analyzer The Intel VTune Performance Analyzer Focusing on Vtune for Intel Itanium running Linux* OS Copyright 2002 Intel Corporation. All rights reserved. VTune and the Intel logo are trademarks or registered trademarks

More information

PAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt

PAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt 1 PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 2 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) It is enough to identify bottlenecks,

More information

Testing Database Performance with HelperCore on Multi-Core Processors

Testing Database Performance with HelperCore on Multi-Core Processors Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob Problem

More information

Visualizing gem5 via ARM DS-5 Streamline. Dam Sunwoo (dam.sunwoo@arm.com) ARM R&D December 2012

Visualizing gem5 via ARM DS-5 Streamline. Dam Sunwoo (dam.sunwoo@arm.com) ARM R&D December 2012 Visualizing gem5 via ARM DS-5 Streamline Dam Sunwoo (dam.sunwoo@arm.com) ARM R&D December 2012 1 The Challenge! System-level research and performance analysis becoming ever so complicated! More cores and

More information

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015 Operating Systems 05. Threads Paul Krzyzanowski Rutgers University Spring 2015 February 9, 2015 2014-2015 Paul Krzyzanowski 1 Thread of execution Single sequence of instructions Pointed to by the program

More information

Perfmon2: a flexible performance monitoring interface for Linux

Perfmon2: a flexible performance monitoring interface for Linux Perfmon2: a flexible performance monitoring interface for Linux Stéphane Eranian HP Labs eranian@hpl.hp.com Abstract Monitoring program execution is becoming more than ever key to achieving world-class

More information

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux White Paper Real-time Capabilities for Linux SGI REACT Real-Time for Linux Abstract This white paper describes the real-time capabilities provided by SGI REACT Real-Time for Linux. software. REACT enables

More information

OpenSPARC T1 Processor

OpenSPARC T1 Processor OpenSPARC T1 Processor The OpenSPARC T1 processor is the first chip multiprocessor that fully implements the Sun Throughput Computing Initiative. Each of the eight SPARC processor cores has full hardware

More information

Security Overview of the Integrity Virtual Machines Architecture

Security Overview of the Integrity Virtual Machines Architecture Security Overview of the Integrity Virtual Machines Architecture Introduction... 2 Integrity Virtual Machines Architecture... 2 Virtual Machine Host System... 2 Virtual Machine Control... 2 Scheduling

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

D1.2 Network Load Balancing

D1.2 Network Load Balancing D1. Network Load Balancing Ronald van der Pol, Freek Dijkstra, Igor Idziejczak, and Mark Meijerink SARA Computing and Networking Services, Science Park 11, 9 XG Amsterdam, The Netherlands June ronald.vanderpol@sara.nl,freek.dijkstra@sara.nl,

More information

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Overview Metrics Monitor is part of Intel Media Server Studio 2015 for Linux Server. Metrics Monitor is a user space shared library

More information

Xeon+FPGA Platform for the Data Center

Xeon+FPGA Platform for the Data Center Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

Chapter 3 Operating-System Structures

Chapter 3 Operating-System Structures Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual

More information

Sequential Performance Analysis with Callgrind and KCachegrind

Sequential Performance Analysis with Callgrind and KCachegrind Sequential Performance Analysis with Callgrind and KCachegrind 4 th Parallel Tools Workshop, HLRS, Stuttgart, September 7/8, 2010 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation

More information

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03

More information

theguard! ApplicationManager System Windows Data Collector

theguard! ApplicationManager System Windows Data Collector theguard! ApplicationManager System Windows Data Collector Status: 10/9/2008 Introduction... 3 The Performance Features of the ApplicationManager Data Collector for Microsoft Windows Server... 3 Overview

More information

Sequential Performance Analysis with Callgrind and KCachegrind

Sequential Performance Analysis with Callgrind and KCachegrind Sequential Performance Analysis with Callgrind and KCachegrind 2 nd Parallel Tools Workshop, HLRS, Stuttgart, July 7/8, 2008 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut

More information

Using a Generic Plug and Play Performance Monitor for SoC Verification

Using a Generic Plug and Play Performance Monitor for SoC Verification Using a Generic Plug and Play Performance Monitor for SoC Verification Dr. Ambar Sarkar Kaushal Modi Janak Patel Bhavin Patel Ajay Tiwari Accellera Systems Initiative 1 Agenda Introduction Challenges Why

More information

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment Technical Paper Moving SAS Applications from a Physical to a Virtual VMware Environment Release Information Content Version: April 2015. Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary,

More information

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com Best Practises for LabVIEW FPGA Design Flow 1 Agenda Overall Application Design Flow Host, Real-Time and FPGA LabVIEW FPGA Architecture Development FPGA Design Flow Common FPGA Architectures Testing and

More information

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

Operating Systems 4 th Class

Operating Systems 4 th Class Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science

More information

serious tools for serious apps

serious tools for serious apps 524028-2 Label.indd 1 serious tools for serious apps Real-Time Debugging Real-Time Linux Debugging and Analysis Tools Deterministic multi-core debugging, monitoring, tracing and scheduling Ideal for time-critical

More information

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0 D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline

More information

Eight Ways to Increase GPIB System Performance

Eight Ways to Increase GPIB System Performance Application Note 133 Eight Ways to Increase GPIB System Performance Amar Patel Introduction When building an automated measurement system, you can never have too much performance. Increasing performance

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers A Comparative Study on Vega-HTTP & Popular Open-source Web-servers Happiest People. Happiest Customers Contents Abstract... 3 Introduction... 3 Performance Comparison... 4 Architecture... 5 Diagram...

More information

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se A Brief Survery of Linux Performance Engineering Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se Overview On chip Hardware Performance Counters Linux Performance Counter Infrastructure

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Gigabit Ethernet Packet Capture. User s Guide

Gigabit Ethernet Packet Capture. User s Guide Gigabit Ethernet Packet Capture User s Guide Copyrights Copyright 2008 CACE Technologies, Inc. All rights reserved. This document may not, in whole or part, be: copied; photocopied; reproduced; translated;

More information

Going Linux on Massive Multicore

Going Linux on Massive Multicore Embedded Linux Conference Europe 2013 Going Linux on Massive Multicore Marta Rybczyńska 24th October, 2013 Agenda Architecture Linux Port Core Peripherals Debugging Summary and Future Plans 2 Agenda Architecture

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

CHAPTER 15: Operating Systems: An Overview

CHAPTER 15: Operating Systems: An Overview CHAPTER 15: Operating Systems: An Overview The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint

More information

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis White Paper Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis White Paper March 2014 2014 Cisco and/or its affiliates. All rights reserved. This document

More information

Introduction. What is an Operating System?

Introduction. What is an Operating System? Introduction What is an Operating System? 1 What is an Operating System? 2 Why is an Operating System Needed? 3 How Did They Develop? Historical Approach Affect of Architecture 4 Efficient Utilization

More information

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1 Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems

More information

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology 3. The Lagopus SDN Software Switch Here we explain the capabilities of the new Lagopus software switch in detail, starting with the basics of SDN and OpenFlow. 3.1 SDN and OpenFlow Those engaged in network-related

More information

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE Guillène Ribière, CEO, System Architect Problem Statement Low Performances on Hardware Accelerated Encryption: Max Measured 10MBps Expectations: 90 MBps

More information

Embedded Systems: map to FPGA, GPU, CPU?

Embedded Systems: map to FPGA, GPU, CPU? Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware

More information

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

COMPUTER HARDWARE. Input- Output and Communication Memory Systems COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)

More information

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts

More information

Rambus Smart Data Acceleration

Rambus Smart Data Acceleration Rambus Smart Data Acceleration Back to the Future Memory and Data Access: The Final Frontier As an industry, if real progress is to be made towards the level of computing that the future mandates, then

More information

Self-monitoring Overhead of the Linux perf event Performance Counter Interface

Self-monitoring Overhead of the Linux perf event Performance Counter Interface Paper Appears in ISPASS 215, IEEE Copyright Rules Apply Self-monitoring Overhead of the Linux perf event Performance Counter Interface Vincent M. Weaver Electrical and Computer Engineering University of

More information

Operating System Impact on SMT Architecture

Operating System Impact on SMT Architecture Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts

More information

Freescale Semiconductor, I

Freescale Semiconductor, I nc. Application Note 6/2002 8-Bit Software Development Kit By Jiri Ryba Introduction 8-Bit SDK Overview This application note describes the features and advantages of the 8-bit SDK (software development

More information

THeME: A System for Testing by Hardware Monitoring Events

THeME: A System for Testing by Hardware Monitoring Events THeME: A System for Testing by Hardware Monitoring Events Kristen Walcott-Justice University of Virginia Charlottesville, VA USA walcott@cs.virginia.edu Jason Mars University of Virginia Charlottesville,

More information

Achieving QoS in Server Virtualization

Achieving QoS in Server Virtualization Achieving QoS in Server Virtualization Intel Platform Shared Resource Monitoring/Control in Xen Chao Peng (chao.p.peng@intel.com) 1 Increasing QoS demand in Server Virtualization Data center & Cloud infrastructure

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

AMD PhenomII. Architecture for Multimedia System -2010. Prof. Cristina Silvano. Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923

AMD PhenomII. Architecture for Multimedia System -2010. Prof. Cristina Silvano. Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923 AMD PhenomII Architecture for Multimedia System -2010 Prof. Cristina Silvano Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923 Outline Introduction Features Key architectures References AMD Phenom

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Virtualization for Cloud Computing

Virtualization for Cloud Computing Virtualization for Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF CLOUD COMPUTING On demand provision of computational resources

More information

Eloquence Training What s new in Eloquence B.08.00

Eloquence Training What s new in Eloquence B.08.00 Eloquence Training What s new in Eloquence B.08.00 2010 Marxmeier Software AG Rev:100727 Overview Released December 2008 Supported until November 2013 Supports 32-bit and 64-bit platforms HP-UX Itanium

More information

Hardware Assisted Virtualization

Hardware Assisted Virtualization Hardware Assisted Virtualization G. Lettieri 21 Oct. 2015 1 Introduction In the hardware-assisted virtualization technique we try to execute the instructions of the target machine directly on the host

More information

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Kurt Klemperer, Principal System Performance Engineer kklemperer@blackboard.com Agenda Session Length:

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

perfmon2: a flexible performance monitoring interface for Linux perfmon2: une interface flexible pour l'analyse de performance sous Linux

perfmon2: a flexible performance monitoring interface for Linux perfmon2: une interface flexible pour l'analyse de performance sous Linux perfmon2: a flexible performance monitoring interface for Linux perfmon2: une interface flexible pour l'analyse de performance sous Linux Stéphane Eranian HP Labs July 2006 Ottawa Linux Symposium 2006

More information

10.04.2008. Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

10.04.2008. Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details Thomas Fahrig Senior Developer Hypervisor Team Hypervisor Architecture Terminology Goals Basics Details Scheduling Interval External Interrupt Handling Reserves, Weights and Caps Context Switch Waiting

More information

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC

More information

Hardware performance monitoring. Zoltán Majó

Hardware performance monitoring. Zoltán Majó Hardware performance monitoring Zoltán Majó 1 Question Did you take any of these lectures: Computer Architecture and System Programming How to Write Fast Numerical Code Design of Parallel and High Performance

More information

Toward Accurate Performance Evaluation using Hardware Counters

Toward Accurate Performance Evaluation using Hardware Counters Toward Accurate Performance Evaluation using Hardware Counters Wiplove Mathur Jeanine Cook Klipsch School of Electrical and Computer Engineering New Mexico State University Box 3, Dept. 3-O Las Cruces,

More information

HPSA Agent Characterization

HPSA Agent Characterization HPSA Agent Characterization Product HP Server Automation (SA) Functional Area Managed Server Agent Release 9.0 Page 1 HPSA Agent Characterization Quick Links High-Level Agent Characterization Summary...

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information