Self-monitoring Overhead of the Linux perf event Performance Counter Interface
|
|
|
- Rudolph Welch
- 10 years ago
- Views:
Transcription
1 Paper Appears in ISPASS 215, IEEE Copyright Rules Apply Self-monitoring Overhead of the Linux perf event Performance Counter Interface Vincent M. Weaver Electrical and Computer Engineering University of Maine Abstract Most modern CPUs include hardware performance counters: architectural registers that allow programmers to gain low-level insight into system performance. Low-overhead access to these counters is necessary for accurate performance analysis, making the operating system interface critical to providing lowlatency performance data. We investigate the overhead of selfmonitoring performance counter measurements on the Linux perf event interface. We find that default code (such as that used by PAPI) implementing the perf event self-monitoring interface can have large overhead: up to an order of magnitude larger than the previously used perfctr and perfmon2 performance counter implementations. We investigate the causes of this overhead and find that with proper coding this overhead can be greatly reduced on recent Linux kernels. I. INTRODUCTION Most moderns CPUs contain hardware performance counters: architectural registers that allow low-level analysis of running programs. Low-overhead access to these counters (both in time and in CPU resources) is necessary for useful and accurate performance analysis. Raw access to the underlying counters typically requires supervisor-level permissions; this access is usually managed by the operating system. The operating system thus enters the critical path of low-overhead counter access. The Linux operating system kernel is widely used in situations where performance analysis is critical. This includes High Performance Computing (HPC) where as of June % of the Top5 supercomputers in the world run some form of Linux [1]. At the other extreme are embedded systems where performance optimization enables the maximum use of limited hardware resources. Linux is key here too, as Android phones (which run Linux kernels) make up 81% of all smartphones shipped in 213 [2]. Until 29 the default Linux kernel did not include support for performance counters (there were out-of-tree implementations available but these were not included in most distributions and required custom kernel patching). The Linux release introduced first-class performance counter support with the perf event subsystem. The perf event interface differs from previous Linux performance counter implementations, primarily in the amount of functionality handled by the operating system kernel. Many of the tasks perf event does in-kernel were previously done in userspace libraries, impacting measurement overhead. There are three common ways of using performance counters: aggregate measurement, statistical sampling and selfmonitoring. Aggregate measurement is the easiest and lowest-overhead method of gathering performance results. The counters are enabled just before a program starts running and total results are gathered when it finishes. The operating system aids measurement by saving and restoring values on context switch and also handling overflows where a value exceeds the counter width. This methodology provides exact total measurements, showing high-level overviews of program behavior. For detailed per-function or per-instruction results other methods must be used. With statistical sampling the counters are programmed to periodically gather measurements at regular intervals via the generation of timer or overflow interrupts. By recording the instruction pointer at time of sampling, function-granularity performance results can be extrapolated statistically. This has the advantage that the program does not have to be modified with calls to measurement routines, but does have the disadvantage that the values are not exact and resolution can be coarse. Setting the sampling rate higher can mitigate this, but if set too high leads to increased interrupt and operating system overhead which can overwhelm the actual measurements. Not all hardware supports sampling; some systems lack a working overflow interrupt (for example various ARM systems 1 ), but this can be worked around in software. This paper concentrates instead on self-monitoring usage of the counters, with an emphasis on the time overhead imposed by the operating system interface. With self-monitoring a program is instrumented to gather exact performance measurements for blocks of code. Calls to measurement routines are inserted into the code either by modifying the source code or by using some form of binary instrumentation. While statistical sampling can tell you, on-average, which function in your program typically has the most of a metric (say L2-cache misses), with self-monitoring you can insert calipers around that function and find exact measurements. The widely used cross-platform PAPI [3] performance measurement library uses self-monitoring to provide detailed measurements. When PAPI transitioned to perf event from the previous perfctr and perfmon2 interfaces, a sharp increase in overhead was found. At the time it was hoped that this was 1 The overflow interrupt is not available on Raspberry Pi hardware, and there are various errata that can cause overflow interrupts to be lost on Cortex A8 and Cortex A9 systems
2 Paper Appears in ISPASS 215, IEEE Copyright Rules Apply due to the newness of perf event and that the overhead would decrease over time as the code was improved. As seen in Figure 1 this was found not to be the case, which led to this investigation of why perf event overhead is high and what can be done to get results more inline with the previous perfctr and perfmon2 interfaces. II. BACKGROUND The hardware interface for performance counters involves accessing special CPU registers (in the x86 world these are known as Model Specific Registers, or MSRs). There are usually two types of registers: configuration registers (which allow starting and stopping the counters, choosing the events to monitor, and setting up overflow interrupts) and counting registers (which hold the current event counts). In general between 2-8 counting registers are available, although machines with many more are possible [4]. Reads and writes to the configuration registers usually require special privileged (ring or supervisor) instructions; the operating system can allow access by translating user requests into the proper lowlevel CPU calls. Access to counting registers may require extra permissions; some processors provide special instructions that allow users to access the values directly (rdpmc on x86). A typical operating system performance counter interface allows selecting events to monitor, starting and stopping counters, reading counter values, and (if the CPU supports notification on counter overflow) some mechanism for passing overflow information to the user. Some operating systems provide additional features, such as event scheduling (handling limitations on which events can go into which counters), multiplexing (swapping events in and out and using time accounting to approximate the availability of more physical counters), per-thread counting (by loading and saving counter values at context switch time), process attaching (obtaining counts from an already running process), and per-cpu counting. High-level tools provide common cross-architecture and cross-platform interfaces to performance counters. PAPI [3] is a widely-used performance counter abstraction interface that is in turn used by other tools (such as TAU [5], Vampir [6], or HPCToolkit [7]) that provide graphical frontends to the underlying performance counter infrastructure. Although users will typically work with these higher-level interfaces directly, performance analysis works best if backed by libraries and operating systems that can provide efficient low-overhead access to the counters. A. Counter Implementations Most modern processors have support for performance counters and thus many operating systems support interfaces to access them. Commercial UNIX systems [8], [9], [1] provided some of the earliest hardware performance counter implementations. Microsoft Windows does not include native support for hardware performance counters; third-party tools (such as Intel VTune [11], Intel VTune Amplifier, Intel PTU [12], and AMD s CodeAnalyst [13]) provide access by programming the CPU registers directly. Since counter state is not saved on context switch only system wide sampling is available. Apple OSX is similar to Windows in that only system wide sampling is available by directly programming the hardware. A tool that does this is Shark [14]. FreeBSD[15] provides a more featurerich interface that is similar to that available for Linux. Patches providing Linux support appeared soon after the release of the original Pentium processor, the first x86 processor with counter hardware. These early implementations [16], [17], [18], [19], [2], [21] exported raw access to the underlying Model Specific Registers (MSRs) but also gradually added more features such as saving values on context-switch. Despite these many attempts at Linux performance counter interfaces, only three saw widespread use before the adoption of perf event: oprofile, perfctr, and perfmon2. 1) Oprofile: Oprofile [22] is a system-wide sampling profiler included in the 22 Linux release. It allows sampling with arbitrary performance events (or a timer if you lack performance counters) and provides frequency graphs, profiles, and stack traces. The kernel interface is through a pseudofilesystem under /dev/oprofile; profiling is started and stopped by writing there. Oprofile has limitations that make it unsuitable for general analysis: it is system-wide only, it requires starting a daemon as root, and it is a sampled interface so does not support self-monitoring. Currently Oprofile is still supported, although it is considered deprecated (with perf event being the replacement). 2) Perfctr: Perfctr [23] is a widely-used performance counter interface introduced in The kernel interface involves opening a /dev/perfctr device and accessing it with various ioctl() calls. Fast counter reads (that do not require a slow system call into the kernel) are supported using the x86 rdpmc instruction in conjunction with mmap(). A libperfctr is provided which abstracts the kernel interface. 3) Perfmon2: Perfmon2 [24] is an extension of the itaniumspecific Perfmon. The perfmon2 interface adds a variety of system calls with some additional system-wide configuration done via the /sys pseudo-filesystem. Abstract PMC (config) and PMD (data) structures provide a thin layer over the raw hardware counters. The original v2 interface involved 12 new system calls. In response to concerns raised during code review this was reduced to just 4, but in the end this approach was abandoned by the Linux developers in preference for perf event. We use the last publicly released perfmon2 git tree in this work, a patched Linux kernel which uses version 2.9 of the interface with all 12 system calls. The libpfm3 library provides a high-level interface, providing event name tables and code that schedules events to avoid counter conflicts. These tasks are done in userspace (in contrast to perf event which does this in the kernel). B. perf event The perf event subsystem [25] was created in 29 as a response to the proposed merge of perfmon2. perf event entered Linux as Performance Counters for Linux and was subsequently renamed perf event in the release. The underlying perf event design philosophy is to provide as much functionality and abstraction as possible in the kernel, making the interface straightforward for ordinary users. The interface is built around file descriptors allocated with the new perf_event_open() system call [26]. Events are specified at open time in an elaborate perf_event_attr 2
3 Paper Appears in ISPASS 215, IEEE Copyright Rules Apply structure; this structure has over 4 different fields that interact in complex ways. Counters are enabled and disabled via calls to ioctl() or prctl() and values are read via the standard read() system call. Sampling can be enabled to periodically read counters and write the results to a ring buffer which can be accessed via mmap(); new data availability can be determined via a signal or by poll(). Some events have elaborate hardware constraints and can only run in a certain subset of available counters. Perfmon2 and perfctr rely on libraries to provide event scheduling in userspace; perf event does this in the kernel. Scheduling is performance critical; a full scheduling algorithm can require O(N!) time (where N is the number of events). Heuristics are used to limit this, though this can lead to inefficiencies where valid event combinations are rejected as unschedulable. Experimentation with new schedulers is difficult with an inkernel implementation as this requires kernel modification and a reboot between tests rather than simply recompiling a userspace library. Some users wish to measure more simultaneous events than the hardware can physically support. This requires multiplexing: events are repeatedly run for a short amount of time, then switched out with other events and an estimated total count can be statistically extrapolated [27], [28]. perf event multiplexes in the kernel (using a round-robin fixed interval), which can provide better performance [29] but less flexibility. More advanced sampling patterns and randomized intervals could provide lower error [3], [31] but perf event does not support such modes. C. Post perf event Implementations perf event s greatest advantage is that it is included in the Linux kernel; this is a huge barrier to entry for all competing implementations. Competitors must show compelling advantages before a user will bother taking the trouble to install something else. Despite this, various new implementations have been proposed. LIKWID [32] is a method of accessing performance counters on Linux that completely bypasses the Linux kernel by accessing MSRs directly. This can have low overhead but can conflict with concurrent use of perf event, as well as causing security issues (a regular user can obtain root privileges if given write access to arbitrary MSRs; see CVE ). Only x86 processors are supported, and per-process measurements are not possible (due to lack of MSR save on context-switch). True self-monitoring is not possible, although markers can be added to user code that signal an external monitoring process to read the counters. LiMiT [33] is an interface similar to the existing perfctr infrastructure. They find up to 23 times speedup versus perf event when instrumenting locks. Their methodology requires modifying the kernel and is x86 only. III. EXPERIMENTAL SETUP I compare the self-monitoring overhead of perf event against the perfctr and perfmon2 interfaces on various x86 64 machines as listed in Table I. TABLE I: Machines used in this study. Processor Counters Available Intel Atom Cedarview D255 2 general 3 fixed Intel Core2 P87 2 general 3 fixed Intel IvyBridge i5-321m 4 general 3 fixed AMD Bobcat G-T56N 4 general Performance measurements have the potential to disrupt the execution of programs; therefore it is necessary to keep instrumentation overhead as small as possible. Self-monitoring tools such as PAPI add two different chunks of code to an instrumented program. The first initializes the library and sets up the events to be measured. This initialization step can involve a considerable amount of code, but if placed early in the program execution it hopefully does not affect measurements that happen later. The second piece of code added is the actual instrumentation, the calipers conducting measurements around the area of interest. This code starts the counters and then stops and reads values when finished (often the raw values are stored for later analysis, as any additional processing would add extraneous overhead). These operations are the ones that need to have the lowest possible overhead as not to interfere with measurement. We use the x86 rdtsc timestamp counter to obtain finegrained timing of overhead. We measure the overhead of start, stop, and read on four x86 64 machines using perf event, perfctr, and perfmon2. An empty code block (a start followed by an immediate stop) is used to avoid any impact that arbitrary test code might have on measurement overhead. All three measurements are made in a single run by placing the timestamp instruction in between function calls, as shown in the following perf event example (the perfmon2 and perfctr code is similar): / Events opened / i n i t i a l i z e d p r e v i o u s l y / s t a r t b e f o r e = r d t s c ( ) ; r e t 1 = i o c t l ( fd [ ], PERF EVENT IOC ENABLE, ) ; s t a r t a f t e r = r d t s c ( ) ; r e t 2 = i o c t l ( fd [ ], PERF EVENT IOC DISABLE, ) ; s t o p a f t e r = r d t s c ( ) ; r e t 3 = r e a d ( fd [ ], b u f f e r, BUFFER SIZE s i z e o f ( long long ) ) ; r e a d a f t e r = r d t s c ( ) ; Direct comparisons of implementations are difficult; perf event support was not added until Linux but perfmon2 development was halted in and the most recent perfctr patch is against I test a full range of perf event kernels starting with and running through 3.18, and show cross-machine results using a recent kernel. For comparisons between the old and new interfaces older hardware (such as core2 systems) are needed as the old interfaces do not support newer CPUs. Unless otherwise specified the kernels were compiled with gcc 4.4 with configurations chosen to be as identical as possible (the configuration files used are available from our website). 3
4 Paper Appears in ISPASS 215, IEEE Copyright Rules Apply perfmon perfmon perfmon perfmon Read Overhead with 1 Event (core2) Linux Version (a) Linux Version Start Overhead with 1 Event (core2) (b) Linux Version Stop Overhead with 1 Event (core2) 3.2. (c) Linux Version Overall Start/Stop/Read Overhead with 1 Event (core2) (d) Fig. 1: Initial per-kernel self-monitoring measurement overhead comparison on core2. A core2 is used as perfctr and perfmon2 are not supported on more modern CPUs. The plots are boxplots: the solid box shows 25th to 75th percentile with a white line at the median; the error bars show one standard deviation, and Xs mark outliers. Outliers are typically due to cache-misses. 4
5 Paper Appears in ISPASS 215, IEEE Copyright Rules Apply The /proc/sys/kernel/nmi_watchdog value is set to on perf event to keep the kernel watchdog from stealing an available counter. We also disable dynamic frequency scaling during the experiments, as we found this affects the timing measurements on some systems. IV. OVERHEAD COMPARISON What follows is an investigation of the causes of selfmonitoring overhead. I investigate total overhead, and then break out the the read, start, and stop results separately. A. Total Overhead The overall plot in Figure 1d shows the initial total overhead results obtained when starting a pre-existing set of events, immediately stopping them (with no work done in between) and then reading the counter results. This gives a lower bound for how much overhead is involved when selfmonitoring. In these tests only one counter event is being read. We run 124 tests and create boxplots that show the 25th and 75th percentiles, median, errorbars indicating one standard deviation, and any additional outliers. The perfmon2 kernel has the best results, followed by perfctr. In general all of the various perf event kernels perform worse, with a surprising amount of inter-kernel variation. The outliers in the graph (which affect standard deviation) we find to be due to TLB and cache misses. The results shown were measured on an older core2 machine (as perfctr and perfmon2 do not support newer machines), but we find similar perf event results on more recent platforms such as Intel Atom, Ivybridge, and AMD Bobcat processors. B. Read Overhead When conducting measurements the most critical source of overhead comes from read operations. While start and stop are necessary, they can be pushed earlier or later (letting the counter run in a free-running mode). Reads however must happen in the area of interest, and often multiple reads are done and stored for later analysis. A performance counter read involves reading a low-level processor register; these reads can be slow, often taking hundreds of cycles. Some processors provide a method of allowing users direct access to these registers (the rdpmc instruction on x86); if that is not available then a register read involves a privileged access and must go through the operating system s (slower) system call interface. The read plot in Figure 1a shows the overhead of just reading a performance counter value on a core2 machine. perfctr has the lowest overhead by far, due to its use of the rdpmc instruction. perfmon2 is not far behind, but all of the perf event kernels lag. I ran many experiments to determine the causes of the high perf event overhead. 1) Compiler: I first look at the compiler as a possible source of overhead differences. In this paper I always use gcc 4.4 compiled kernels for consistency unless otherwise specified. This older version of the compiler is made necessary because older kernel will not compile with newer gcc versions. Figure 2 shows results found when varying the gcc version (version 4.4, 4.6, 4.7 and 4.8). The ensuing variation is small and thus not likely a factor with overhead core2 Overhead of Read with 1 Event gcc 4.4 gcc 4.5 gcc 4.6 gcc 4.7 gcc 4.8 Fig. 2: Overhead boxplot of perf event read on core2 while varying gcc version used to compile the Linux kernel. Compiler version does not majorly contribute to overhead. 2) Dynamic Frequency Scaling: At least on core2 systems having Dynamic Frequency Scaling would alter the results of my measurements. This is most likely due to the scaling affecting the rdtsc instruction used for timing measurements. To avoid this I disable frequency scaling on the test machines. 3) Dynamic vs Static Linking: I profiled the kernel at a low-level in an attempt to find sources of overhead, only to find that much of the overhead was coming from userspace. Unlike previous implementations, perf event does not use a custom system call to read the counters, but rather uses the stock read() call. This is implemented by the C library, which on Linux is typically dynamically linked. With such a binary the read() symbol is not resolved until run time at first use. This introduces the overhead of the dynamic link (which can be considerable) into our measurements. This is partly a limitation in the benchmark, but is a potential real-world problem with actual performance measurements if the executable s first call to read() happens in the selfmonitoring instrumentation code. To avoid this issue, one can statically link the executable (using the -static compiler option) or else hard-code the system call directly. The perfmon2 test does not see this overhead as it uses a custom system call from a staticly linked copy of the libpfm3 library. The perfctr test uses the rdpmc instruction directly and no function calls at all are used during a read. Figure 3 shows that methods using static linking (static and syscall static) improve the overhead results versus dynamic linking, with results approaching the perfmon2 measurement values. 4) rdpmc Instruction: The x86 architecture supports a processor state flag that allows user programs direct, lowoverhead, access to the (usually restricted) performance counter registers. Programs can read the counters without having to go through the operating system. 5
6 Paper Appears in ISPASS 215, IEEE Copyright Rules Apply core2 Overhead of Read with 1 Event bobcat Overhead of Read with 1 Event syscall -rdpmc -rdpmc-touch ivybridge Overhead of Read with 1 Event perfmon2 -syscall -rdpmc -rdpmc-touch perfmon2 atom-cedarview Overhead of Read with 1 Event syscall -rdpmc -rdpmc-touch perfmon2 -syscall -rdpmc -rdpmc-touch perfmon2 Fig. 3: Read results on various machines showing the overhead improvements gained using methods described in this paper. Perfmon2 and perfctr support is not available for newer architectures perfmon2 rdpmc Read Overhead with 1 Event (core2) Linux Version Fig. 4: Initial self-monitoring read overhead on perf event while using rdpmc. The rdpmc interface was introduced in Linux 3.4 and is shown in dark green, overlaying regular read results shown in light grey. The rdpmc results were expected to be much lower, but were not due to page faults in the mmap() interface. 6
7 Paper Appears in ISPASS 215, IEEE Copyright Rules Apply The perfctr interface explicitly uses rdpmc for counter reads. Support for this was added to perf event with version 3.4 of the Linux kernel. To use rdpmc on perf event a helper kernel page must first be mapped into the process address space via mmap(). This mmap page contains information such as which raw counter to read (counters can be moved around due to multiplexing support) as well as how long the event has been scheduled (also for multiplexing support). After the mmap page is accessed for info, then the proper counter is read via rdpmc. A unique number is provided in the mmap page to allow verification that no overflows or other changes have happened that will invalidate the result (if so the read must be retried). Our initial experiments found (to our great surprise) that the perf event rdpmc code was no better, and in in some cases worse, than the overhead of using the read system call. This can be seen in Figure 4. After some investigation it was found that the overhead happens because the first access to the mmap d kernel page causes a page-fault. Handling the page-fault and mapping the data into the program s address space can take thousands of cycles. perfctr avoids this overhead (despite also using a mmap page) by specifically inserting the page directly into the process address space at initialization time so no page fault is necessary. The page fault behavior can be mitigated by either using the MAP_POPULATE option to mmap() when creating the page, or else by explicitly touching the page to bring it in at creation time (rather than at first read time). Figure 3 shows the improved performance when using these methods (rdpmcpopulate and rdpmc-touch). These methods entail much lower overhead, on par with perfctr, despite needing two rdpmc calls for perf event 2. 5) Reading Multiple Times: The results presented so far have shown read overhead when doing a single counter read. Often a user will read multiple times, such as reading upon each iteration of a loop. This spreads around the startup cost and in general each subsequent read should mitigate the initial high overhead. Figure 5 shows this effect happens when using rdpmc but the behavior when using default read() based measurements is more complex. Upon further investigation it turns out that the subsequent read operations are causing L1 Data Cache misses which add extra overhead. C. Start Overhead As can be seen in the start graph in Figure 1b, the overhead of start on perf event could be better. Events are started with an ioctl() call which can trigger a lot of kernel work. perfmon2 has impressively fast start code; costly setup (such as event scheduling) is done in userspace and can be done in advance. In contrast perf event s expensive inkernel event scheduling happens at start time (instead of at event creation). It is surprising that the perfctr results are not better, but this could be due to its unusually complex interface involving complicated ioctl arguments for start and stop. 2 The perf event interface does not zero the counter at start time, so to calculate the count rdpmc must be run before and after the instrumented code and the difference calculated Figure 6 shows the start overhead behavior on various x86 systems when optimized in the same way that reads were. Using a statically linked binary helps with overhead (due to dynamic linking overhead, as seen with read) but perfmon2 start overhead is still much lower. D. Stop Overhead The stop graph in Figure 1c shows the overhead of stopping a counter on core2. The differences between implementations are not as pronounced; the kernel does little besides telling the CPU to stop counting. Still, perfctr and perfmon2 have less overhead. This is not affected much by overhead reduction methodology, as shown in Figure 7. Static linkage does not make as much difference because in our test the start ioctl() call pays the price for dynamic linking and the code is already set up when the stop ioctl() happens. E. Varying number of events The previous graphs look at overhead when measuring one event at a time; Figure 8 show variation on Core2 as we measure more than one event. On most implementations the overhead increases linearly with more events, as the number of MSR reads increases with the event count. perfctr does not grow as fast because it has the ability to read the counters for multiple events using only one memory page, while perf event requires accessing a different mmap page for each event. V. RELATED WORK Previous performance counter investigations concentrate either on the underlying hardware designs or on high-level userspace tools; my work focuses on the often overlooked intermediate operating system interface. Mytkowicz et al. [34] explore measurement bias and sources of inaccuracy in architectural performance measurement. They use PAPI on top of perfmon and perfctr to investigate performance measurement differences while varying the compiler options and link order of benchmarks. They measure at the high level using PAPI and do not investigate sources of operating system variation. Their work predates the introduction of the perf event interface. Zaparanuks et al. [35] study the accuracy of perfctr, perfmon2, and PAPI on Pentium D, Core 2 Duo, and AMD Athlon 64 X2 processors. They measure overhead using libpfm and libperfctr directly, as well as the the low and high level PAPI interfaces. They find measurement error is similar across machines. Their work primarily focuses on counter accuracy and variation rather than overhead (though these effects can be related). They did not investigate the perf event subsystem as it was not available at the time. DeRose et al. [36] investigate performance counter variation and error on a Power3 system with regard to startup and shutdown costs. They do not discuss the underlying operating system interface. Maxwell et al. [37], Moore et al. [38] and Salayandia [39] investigate PAPI overhead, but they do not explore operating system differences. Their work predates PAPI support for perfmon2 or perf event. 7
8 Paper Appears in ISPASS 215, IEEE Copyright Rules Apply core2 Overhead of Successive Read Successive Counter Reads rdpmc-touch-static L1D$ miss x core2 Overhead of Successive Read Successive Counter Reads 1 (a) Multiple read overhead (b) Bottom of graph zoomed to show rdpmc detail Fig. 5: core2 perf event read overhead showing the cost of repeated calls. One would expect the overhead to drop off; we find that due to L1 Cache misses the behavior is more complex. core2 Overhead of Start with 1 Event bobcat Overhead of Start with 1 Event syscall -rdpmc -rdpmc-touch perfmon2 -syscall -rdpmc -rdpmc-touch perfmon2 ivybridge Overhead of Start with 1 Event atom-cedarview Overhead of Start with 1 Event syscall -rdpmc -rdpmc-touch perfmon2 -syscall -rdpmc -rdpmc-touch perfmon2 Fig. 6: Time overhead of starting counters on various machines when using different overhead reduction methods. 8
9 Paper Appears in ISPASS 215, IEEE Copyright Rules Apply core2 Overhead of Stop with 1 Event bobcat Overhead of Stop with 1 Event syscall -rdpmc -rdpmc-touch perfmon2 ivybridge Overhead of Stop with 1 Event -syscall -rdpmc -rdpmc-touch perfmon2 atom-cedarview Overhead of Stop with 1 Event syscall -rdpmc-touch perfmon2 -syscall -rdpmc-touch perfmon2 Fig. 7: Time overhead of stopping counters on various machines when using different overhead reduction methods. Average Time (Cycles) core2 Overall Overhead Read Time Stop Time Start Time rdpmc Perfmon2 Perfctr static static,touch Simultaneous Events Being Measured Fig. 8: Overall overhead on core2 while varying the number of events being measured. Zhang et. al [4] propose providing high-performance counters directly to the operating system for use in scheduling and other tasks. In this case the counters are not exposed to the user, limiting usefulness in analysis or optimization. VI. CONCLUSION AND FUTURE WORK Performance counters are a key resource for finding and avoiding system bottlenecks. Modern high performance architectures are complex enough that it is hard to create theoretical models of performance; analysis is even more difficult when using parallel code bases. The easiest way to gather detailed actual performance data is via hardware performance counters. Low-overhead access to these counters is critical when providing detailed performance measurements. I investigate in detail the self-monitoring overhead of the Linux perf event interface. I find that straightforward implementations of the interface (as used by PAPI) can have much larger overhead than the previous perfmon2 and perfctr interfaces. I show that with some minimal changes the perf event read overhead can be reduced to values that are competitive with the previous interfaces; this involves using rdpmc for 9
10 Paper Appears in ISPASS 215, IEEE Copyright Rules Apply counter reads, pre-faulting in the mmap kernel page, and using statically linked libraries. I plan to use this knowledge to suggest improvements to both user tools such as PAPI as well as the Linux kernel so that the benefits of low-overhead measurements can be obtained without extra user intervention. In addition I would like to investigate overhead on other architectures such as Power and ARM. Modern processors have underutilized advanced counter implementations; the overhead of other interfaces (such as AMD Lightweight Profiling) and other PMUs (such as Uncore and Northbridge events) remain to be investigated. All code and data used in this paper can be found at the following website: vweaver/projects/ perf events/overhead/ REFERENCES [1] Top5, Top 5 supercomputing sites, operating system family, [2] IDC, Android pushes past 8% market share, [3] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, A portable programming interface for performance evaluation on modern processors, International Journal of High Performance Computing Applications, vol. 14, no. 3, pp , 2. [4] V. Salapura, K. Ganesan, A. Gara, M. Gscwind, J. Sexton, and R. Walkup, Next-generation performance counters: Towards monitoring over thousand concurrent events, in Proc. IEEE International Symposium on Performance Analysis of Systems and Software, Apr. 28, pp [5] S. Shende and A. Malony, The Tau parallel performance system, International Journal of High Performance Computing Applications, vol. 2, no. 2, pp , 26. [6] W. Nagel, A. Arnold, M. Weber, H.-C. Hoppe, and K. Solchenbach, VAMPIR: Visualization and analysis of MPI resources, Supercomputer, vol. 12, no. 1, pp. 69 8, [7] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor- Crummey, and N. Tallent, HPCToolkit: Tools for performance analysis of optimized parallel programs, Concurrency and Computation: Practice and Experience, vol. 22, no. 6, pp , 21. [8] M. Zagha, B. Larson, S. Turner, and M. Itzkowitz, Performance Analysis Using the MIPS R1 Performance Counters, Silicon Graphics Inc., [9] J. Dean, J. Hicks, C. Waldspurger, W. Weihl, and G. Chrysos, ProfileMe: Hardware support for instruction-level profiling on out-of-order processors, in Proc. IEEE/ACM 3th International Symposium on Microarchitecture, Dec. 1997, pp [1] J. Anderson, L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S. Leung, L. Sites, M. Vandevoorde, A. Waldspurger, and W. Weihl, Continuous profiling: Where have all the cycles gone? in Proc. 16th ACM Symposium on Operating Systems Principles, Oct. 1997, pp [11] J. Wolf, Programming Methods for the Pentium TM III Processor s Streaming SIMD Extensions Using the VTune TM Performance Enhancement Environment, Intel Corporation, [12] Intel, Intel TM Performance Tuning Utility, [13] P. Drongowski, An introduction to analysis and optimization with AMD CodeAnalyst TM Performance Analyzer, Advanced Micro Devices, Inc., 28. [14] Apple Developer Connection, Optimizing with shark: Big payoff, small effort, optimize.html, 24. [15] J. Koshy, PmcTools, [16] D. Mentrè, Hardware counter per process support, [17] M. Goda and M. Warren, Linux performance counters pperf and libpperf, [18] R. Berrendorf and H. Ziegler, PCL the performance counter library: A common interface to access hardware performance counters on microprocessors, Forschungzentrum Jülich GmbH, Tech. Rep. FZJ- ZAM-IB-9816, [19] H. Hoyer, p5 performance counter MSR driver, [2] E. Hendriks, PERF.7, [21] D. Heller, Rabbit: A performance counters library for Intel/AMD processors and Linux, [22] J. Levon, Oprofile, [23] M. Pettersson, The perfctr interface, mikpe/linux/ perfctr/2.6/, [24] S. Eranian, Perfmon2: a flexible performance monitoring interface for Linux, in Proc. 26 Ottawa Linux Symposium, Jul. 26, pp [25] T. Gleixner and I. Molnar, Performance counters for Linux, 29. [26] V. Weaver, perf event open manual page, in Linux Programmer s Manual, M. Kerrisk, Ed., Dec [27] W. Mathur and J. Cook, Improved estimation for software multiplexing of performance counting, in Proc. 13th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Sep. 25, pp [28] W. Mathur, Improving accuracy for software multiplexing of on-chip performance counters, Master s thesis, New Mexico State University, May 24. [29] R. Azimi, M. Stumm, and R. Wisniewski, Online performance analysis by statistical sampling of microprocessor performance counters, in Proc. 19th ACM International Conference on Supercomputing, 25. [3] T. Mytkowicz, P. Sweeney, M. Hauswirth, and A. Diwan, Time interpolation: So many metrics, so few registers, in Proc. IEEE/ACM 41st Annual International Symposium on Microarchitecture, 27, pp [31] M. Casas, R. Badia, and J. Labarta, Multiplexing hardware counters by spectral analysis, in Para 21 - State of the Art in Scientific and Parallel Computing, Jun. 21. [32] J. Treibig, G. Hager, and G. Wellein, LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments, in Proc. of the First International Workshop on Parallel Software Tools and Tool Infrastructures, Sep. 21. [33] J. Demme and S. Sethumadhavan, Rapid identification of architectural bottlenecks via precise event counting, in Proc. 38th IEEE/ACM International Symposium on Computer Architecture, Jun [34] T. Mytkowicz, A. Diwan, M. Hauswirth, and P. Sweeney, Producing wrong data without doing anything obviously wrong! in Proc. 14th ACM Symposium on Architectural Support for Programming Languages and Operating Systems, Mar. 29. [35] D. Zaparanuks, M. Jovic, and M. Hauswirth, Accuracy of performance counter measurements, in Proc. IEEE International Symposium on Performance Analysis of Systems and Software, Apr. 29, pp [36] L. DeRose, The hardware performance monitor toolkit, in Proc. 7th International Euro-Par Conference, Aug. 21, pp [37] M. Maxwell, P. Teller, L. Salayandia, and S. Moore, Accuracy of performance monitoring hardware, in Proc. Los Alamos Computer Science Institute Symposium, Oct. 22. [38] S. Moore, D. Terpstra, K. London, P. Mucci, P. Teller, L. Salayandia, A. Bayona, and M. Nieto, PAPI deployment, evaluation, and extensions, in Proc. of User Group Conference, Jun. 23. [39] L. Salayandia, A study of the validity and utility of PAPI performance counter data, Master s thesis, The University of Texas at El Paso, Dec. 22. [4] X. Zhang, S. D. amd G. Folkmanis, and K. Shen, Processor hardware counter statistics as a first-class system resource, in Proc. of the 11th USENIX workshop on Hot topics in operating systems, 27, pp
STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION
STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION 1 DIPAK PATIL, 2 PRASHANT KHARAT, 3 ANIL KUMAR GUPTA 1,2 Depatment of Information Technology, Walchand College of
Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10
Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging
Using PAPI for hardware performance monitoring on Linux systems
Using PAPI for hardware performance monitoring on Linux systems Jack Dongarra, Kevin London, Shirley Moore, Phil Mucci, and Dan Terpstra Innovative Computing Laboratory, University of Tennessee, Knoxville,
A Study of Performance Monitoring Unit, perf and perf_events subsystem
A Study of Performance Monitoring Unit, perf and perf_events subsystem Team Aman Singh Anup Buchke Mentor Dr. Yann-Hang Lee Summary Performance Monitoring Unit, or the PMU, is found in all high end processors
End-user Tools for Application Performance Analysis Using Hardware Counters
1 End-user Tools for Application Performance Analysis Using Hardware Counters K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, T. Spencer Abstract One purpose of the end-user tools described in
A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville [email protected]
A Brief Survery of Linux Performance Engineering Philip J. Mucci University of Tennessee, Knoxville [email protected] Overview On chip Hardware Performance Counters Linux Performance Counter Infrastructure
perfmon2: a flexible performance monitoring interface for Linux perfmon2: une interface flexible pour l'analyse de performance sous Linux
perfmon2: a flexible performance monitoring interface for Linux perfmon2: une interface flexible pour l'analyse de performance sous Linux Stéphane Eranian HP Labs July 2006 Ottawa Linux Symposium 2006
Perf Tool: Performance Analysis Tool for Linux
/ Notes on Linux perf tool Intended audience: Those who would like to learn more about Linux perf performance analysis and profiling tool. Used: CPE 631 Advanced Computer Systems and Architectures CPE
Data Structure Oriented Monitoring for OpenMP Programs
A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs Edmond Kereku, Tianchao Li, Michael Gerndt, and Josef Weidendorfer Institut für Informatik, Technische Universität München,
THE BASICS OF PERFORMANCE- MONITORING HARDWARE
THE BASICS OF PERFORMANCE- MONITORING HARDWARE PERFORMANCE-MONITORING FEATURES PROVIDE DATA THAT DESCRIBE HOW AN APPLICATION AND THE OPERATING SYSTEM ARE PERFORMING ON THE PROCESSOR. THIS INFORMATION CAN
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study
CS 377: Operating Systems Lecture 25 - Linux Case Study Guest Lecturer: Tim Wood Outline Linux History Design Principles System Overview Process Scheduling Memory Management File Systems A review of what
Perfmon2: a flexible performance monitoring interface for Linux
Perfmon2: a flexible performance monitoring interface for Linux Stéphane Eranian HP Labs [email protected] Abstract Monitoring program execution is becoming more than ever key to achieving world-class
Hardware performance monitoring. Zoltán Majó
Hardware performance monitoring Zoltán Majó 1 Question Did you take any of these lectures: Computer Architecture and System Programming How to Write Fast Numerical Code Design of Parallel and High Performance
VMware Server 2.0 Essentials. Virtualization Deployment and Management
VMware Server 2.0 Essentials Virtualization Deployment and Management . This PDF is provided for personal use only. Unauthorized use, reproduction and/or distribution strictly prohibited. All rights reserved.
Full and Para Virtualization
Full and Para Virtualization Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF x86 Hardware Virtualization The x86 architecture offers four levels
Perfmon2: A leap forward in Performance Monitoring
Perfmon2: A leap forward in Performance Monitoring Sverre Jarp, Ryszard Jurga, Andrzej Nowak CERN, Geneva, Switzerland [email protected] Abstract. This paper describes the software component, perfmon2,
D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0
D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline
PAPI - PERFORMANCE API. ANDRÉ PEREIRA [email protected]
1 PAPI - PERFORMANCE API ANDRÉ PEREIRA [email protected] 2 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) It is enough to identify bottlenecks,
Delivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
An OS-oriented performance monitoring tool for multicore systems
An OS-oriented performance monitoring tool for multicore systems J.C. Sáez, J. Casas, A. Serrano, R. Rodríguez-Rodríguez, F. Castro, D. Chaver, M. Prieto-Matias Department of Computer Architecture Complutense
Code Coverage Testing Using Hardware Performance Monitoring Support
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye Matthew Iyer Vijay Janapa Reddi Daniel A. Connors Department of Electrical and Computer Engineering University of Colorado
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda
Objectives At the completion of this module, you will be able to: Understand the intended purpose and usage models supported by the VTune Performance Analyzer. Identify hotspots by drilling down through
White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux
White Paper Real-time Capabilities for Linux SGI REACT Real-Time for Linux Abstract This white paper describes the real-time capabilities provided by SGI REACT Real-Time for Linux. software. REACT enables
Architecture of the Kernel-based Virtual Machine (KVM)
Corporate Technology Architecture of the Kernel-based Virtual Machine (KVM) Jan Kiszka, Siemens AG, CT T DE IT 1 Corporate Competence Center Embedded Linux [email protected] Copyright Siemens AG 2010.
Performance Monitoring of the Software Frameworks for LHC Experiments
Proceedings of the First EELA-2 Conference R. mayo et al. (Eds.) CIEMAT 2009 2009 The authors. All rights reserved Performance Monitoring of the Software Frameworks for LHC Experiments William A. Romero
On the Importance of Thread Placement on Multicore Architectures
On the Importance of Thread Placement on Multicore Architectures HPCLatAm 2011 Keynote Cordoba, Argentina August 31, 2011 Tobias Klug Motivation: Many possibilities can lead to non-deterministic runtimes...
Performance Monitoring on an HPVM Cluster
Performance Monitoring on an HPVM Cluster Geetanjali Sampemane [email protected] Scott Pakin [email protected] Department of Computer Science University of Illinois at Urbana-Champaign 1304 W Springfield
PAPI-V: Performance Monitoring for Virtual Machines
PAPI-V: Performance Monitoring for Virtual Machines Matt Johnson, Heike McCraw, Shirley Moore, Phil Mucci, John Nelson, Dan Terpstra, Vince Weaver Electrical Engineering and Computer Science Dept. University
Perfmon2: a standard performance monitoring interface for Linux. Stéphane Eranian <[email protected]>
Perfmon2: a standard performance monitoring interface for Linux Stéphane Eranian Agenda PMU-based performance monitoring Overview of the interface Current status Tools Challenges 2
Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University
Virtual Machine Monitors Dr. Marc E. Fiuczynski Research Scholar Princeton University Introduction Have been around since 1960 s on mainframes used for multitasking Good example VM/370 Have resurfaced
An Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
UTS: An Unbalanced Tree Search Benchmark
UTS: An Unbalanced Tree Search Benchmark LCPC 2006 1 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks
Memory Management under Linux: Issues in Linux VM development
Memory Management under Linux: Issues in Linux VM development Christoph Lameter, Ph.D. Technical Lead, Linux Kernel Software Silicon Graphics Inc. [email protected] 2008-03-12 2008 SGI Sunnyvale, California
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
Performance Tools for Parallel Java Environments
Performance Tools for Parallel Java Environments Sameer Shende and Allen D. Malony Department of Computer and Information Science, University of Oregon {sameer,malony}@cs.uoregon.edu http://www.cs.uoregon.edu/research/paracomp/tau
find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1
Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems
Profiling and Tracing in Linux
Profiling and Tracing in Linux Sameer Shende Department of Computer and Information Science University of Oregon, Eugene, OR, USA [email protected] Abstract Profiling and tracing tools can help make
Operating System for the K computer
Operating System for the K computer Jun Moroo Masahiko Yamada Takeharu Kato For the K computer to achieve the world s highest performance, Fujitsu has worked on the following three performance improvements
Libmonitor: A Tool for First-Party Monitoring
Libmonitor: A Tool for First-Party Monitoring Mark W. Krentel Dept. of Computer Science Rice University 6100 Main St., Houston, TX 77005 [email protected] ABSTRACT Libmonitor is a library that provides
Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings
Operatin g Systems: Internals and Design Principle s Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Bear in mind,
Application Performance Analysis Tools and Techniques
Mitglied der Helmholtz-Gemeinschaft Application Performance Analysis Tools and Techniques 2012-06-27 Christian Rössel Jülich Supercomputing Centre [email protected] EU-US HPC Summer School Dublin
KVM & Memory Management Updates
KVM & Memory Management Updates KVM Forum 2012 Rik van Riel Red Hat, Inc. KVM & Memory Management Updates EPT Accessed & Dirty Bits 1GB hugepages Balloon vs. Transparent Huge Pages Automatic NUMA Placement
Linux Driver Devices. Why, When, Which, How?
Bertrand Mermet Sylvain Ract Linux Driver Devices. Why, When, Which, How? Since its creation in the early 1990 s Linux has been installed on millions of computers or embedded systems. These systems may
Virtual Machines. www.viplavkambli.com
1 Virtual Machines A virtual machine (VM) is a "completely isolated guest operating system installation within a normal host operating system". Modern virtual machines are implemented with either software
CS 147: Computer Systems Performance Analysis
CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and
Uses for Virtual Machines. Virtual Machines. There are several uses for virtual machines:
Virtual Machines Uses for Virtual Machines Virtual machine technology, often just called virtualization, makes one computer behave as several computers by sharing the resources of a single computer between
PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts
PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts Dan Connors, Kyle Dunn, and Ryan Bueter Department of Electrical Engineering University of Colorado Denver Denver, Colorado
Performance monitoring with Intel Architecture
Performance monitoring with Intel Architecture CSCE 351: Operating System Kernels Lecture 5.2 Why performance monitoring? Fine-tune software Book-keeping Locating bottlenecks Explore potential problems
ELEC 377. Operating Systems. Week 1 Class 3
Operating Systems Week 1 Class 3 Last Class! Computer System Structure, Controllers! Interrupts & Traps! I/O structure and device queues.! Storage Structure & Caching! Hardware Protection! Dual Mode Operation
Operating System Impact on SMT Architecture
Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th
Microkernels, virtualization, exokernels. Tutorial 1 CSC469
Microkernels, virtualization, exokernels Tutorial 1 CSC469 Monolithic kernel vs Microkernel Monolithic OS kernel Application VFS System call User mode What was the main idea? What were the problems? IPC,
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers Happiest People. Happiest Customers Contents Abstract... 3 Introduction... 3 Performance Comparison... 4 Architecture... 5 Diagram...
Java Virtual Machine: the key for accurated memory prefetching
Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain
Virtualization. Clothing the Wolf in Wool. Wednesday, April 17, 13
Virtualization Clothing the Wolf in Wool Virtual Machines Began in 1960s with IBM and MIT Project MAC Also called open shop operating systems Present user with the view of a bare machine Execute most instructions
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
evm Virtualization Platform for Windows
B A C K G R O U N D E R evm Virtualization Platform for Windows Host your Embedded OS and Windows on a Single Hardware Platform using Intel Virtualization Technology April, 2008 TenAsys Corporation 1400
COS 318: Operating Systems
COS 318: Operating Systems OS Structures and System Calls Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Outline Protection mechanisms
Hypervisors and Virtual Machines
Hypervisors and Virtual Machines Implementation Insights on the x86 Architecture DON REVELLE Don is a performance engineer and Linux systems/kernel programmer, specializing in high-volume UNIX, Web, virtualization,
COS 318: Operating Systems. Virtual Machine Monitors
COS 318: Operating Systems Virtual Machine Monitors Kai Li and Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall13/cos318/ Introduction u Have
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are
Introduction to Virtual Machines
Introduction to Virtual Machines Introduction Abstraction and interfaces Virtualization Computer system architecture Process virtual machines System virtual machines 1 Abstraction Mechanism to manage complexity
Resource usage monitoring for KVM based virtual machines
2012 18th International Conference on Adavanced Computing and Communications (ADCOM) Resource usage monitoring for KVM based virtual machines Ankit Anand, Mohit Dhingra, J. Lakshmi, S. K. Nandy CAD Lab,
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
VMkit A lightweight hypervisor library for Barrelfish
Masters Thesis VMkit A lightweight hypervisor library for Barrelfish by Raffaele Sandrini Due date 2 September 2009 Advisors: Simon Peter, Andrew Baumann, and Timothy Roscoe ETH Zurich, Systems Group Department
Software Tracing of Embedded Linux Systems using LTTng and Tracealyzer. Dr. Johan Kraft, Percepio AB
Software Tracing of Embedded Linux Systems using LTTng and Tracealyzer Dr. Johan Kraft, Percepio AB Debugging embedded software can be a challenging, time-consuming and unpredictable factor in development
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
KVM: A Hypervisor for All Seasons. Avi Kivity [email protected]
KVM: A Hypervisor for All Seasons Avi Kivity [email protected] November 2007 Virtualization Simulation of computer system in software Components Processor: register state, instructions, exceptions Memory
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
GUEST OPERATING SYSTEM BASED PERFORMANCE COMPARISON OF VMWARE AND XEN HYPERVISOR
GUEST OPERATING SYSTEM BASED PERFORMANCE COMPARISON OF VMWARE AND XEN HYPERVISOR ANKIT KUMAR, SAVITA SHIWANI 1 M. Tech Scholar, Software Engineering, Suresh Gyan Vihar University, Rajasthan, India, Email:
Users are Complaining that the System is Slow What Should I Do Now? Part 1
Users are Complaining that the System is Slow What Should I Do Now? Part 1 Jeffry A. Schwartz July 15, 2014 SQLRx Seminar [email protected] Overview Most of you have had to deal with vague user complaints
Example of Standard API
16 Example of Standard API System Call Implementation Typically, a number associated with each system call System call interface maintains a table indexed according to these numbers The system call interface
Intel DPDK Boosts Server Appliance Performance White Paper
Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Part I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
Measuring Energy and Power with PAPI
Measuring Energy and Power with PAPI Vincent M. Weaver, Matt Johnson, Kiran Kasichayanula, James Ralph, Piotr Luszczek, Dan Terpstra, and Shirley Moore Innovative Computing Laboratory University of Tennessee
Windows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
Hardware Assisted Virtualization
Hardware Assisted Virtualization G. Lettieri 21 Oct. 2015 1 Introduction In the hardware-assisted virtualization technique we try to execute the instructions of the target machine directly on the host
Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6
Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6 Winter Term 2008 / 2009 Jun.-Prof. Dr. André Brinkmann [email protected] Universität Paderborn PC² Agenda Multiprocessor and
Wiggins/Redstone: An On-line Program Specializer
Wiggins/Redstone: An On-line Program Specializer Dean Deaver Rick Gorton Norm Rubin {dean.deaver,rick.gorton,norm.rubin}@compaq.com Hot Chips 11 Wiggins/Redstone 1 W/R is a Software System That: u Makes
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
KVM: Kernel-based Virtualization Driver
KVM: Kernel-based Virtualization Driver White Paper Overview The current interest in virtualization has led to the creation of several different hypervisors. Most of these, however, predate hardware-assisted
PERFORMANCE TOOLS DEVELOPMENTS
PERFORMANCE TOOLS DEVELOPMENTS Roberto A. Vitillo presented by Paolo Calafiura & Wim Lavrijsen Lawrence Berkeley National Laboratory Future computing in particle physics, 16 June 2011 1 LINUX PERFORMANCE
Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters
Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.
Chapter 5 Cloud Resource Virtualization
Chapter 5 Cloud Resource Virtualization Contents Virtualization. Layering and virtualization. Virtual machine monitor. Virtual machine. Performance and security isolation. Architectural support for virtualization.
