<Insert Picture Here> Tracing on Linux: the Old, the New, and the Ugly Elena Zannoni (elena.zannoni@oracle.com) Linux Engineering, Oracle America October 27 2011
Categories of Tracing Tools Kernel Tracing Userspace Applications Tracing Static Tracing Dynamic Tracing Sampling Continuous Integrated with kernel Standalone 2
A Brief History LTT (Linux Trace Toolkit) (1998) replaced in 2006 Kprobes (2004): http://www.ibm.com/developerworks/library/l-kprobes/index.html Dtrace for Solaris (2003) released on Solaris 10 in November 2005 Systemtap for Linux (2005) at OLS LTTng (LTT Next Generation) (2006): https://lkml.org/lkml/2006/1/9/223 Ftrace: (2008): https://lkml.org/lkml/2008/1/3/26 Perf: (2008): https://lkml.org/lkml/2008/12/4/401 3
Currently Active Projects Ftrace Perf Systemtap LTTng 4
Ftrace In kernel with portions outside Main Maintainer: Steven Rostedt (2008) CLI: /sys/kernel/debug/tracing as interface (control and output) GUI for data visualization (KernelShark) http://people.redhat.com/srostedt/kernelshark/html/ Trace-cmd: user space tool with subcommands http://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git Fundamental mechanism: gprof hooks. Adds mcount() call at entry of each kernel function. Requires kernel to be compiled with -pg option for some functionality User: Root privileges No user space tracing Documentation: in kernel tree Documentation/ftrace.txt and Documentation/ftrace_design.txt See: http://www.lwn.net/articles/365835/ & http://www.lwn.net/articles/366796/ 5
Information collected: ftrace plugins Function: trace entry of all kernel functions Function-graph: traces on both entry and exit of the functions. It then provides the ability to draw a graph of function calls Wakeup: max latency that it takes for the highest priority task to get scheduled after it has been woken up Irqsoff: areas that disable interrupts and saves the trace with the longest max latency Preemptoff: amount of time for which preemption is disabled Preemptirqsoff: largest time for which irqs and/or preemption is disabled Nop: trace nothing Uses /sys/kernel/debug/tracing/ 6
Ftrace: trace-cmd User space tool Requires root Some commands are: Record: start collection of events into file trace.dat Can use ftrace plugins (-p) or static kernel events (-e) Report: display content of trace.dat Start: start tracing but keep data in kernel ring buffer Stop: stop collecting data into the kernel ring buffer Extract: extract data from ring buffer into trace.dat List: list available plugins (-p) or events (-e) Many options, very flexible http://lwn.net/articles/410200/ & http://lwn.net/articles/341902/ 7
Ftrace: Original Interface # echo sched_switch > current_tracer # echo 1 > tracing_on # sleep 1 # echo 0 > tracing_on # cat trace # tracer: sched_switch # TASK PID CPU# TIMESTAMP FUNCTION # bash 3997 [01] 240.132281: 3997:120:R + 4055:120:R bash 3997 [01] 240.132284: 3997:120:R ==> 4055:120:R sleep 4055 [01] 240.132371: 4055:120:S ==> 3997:120:R bash 3997 [01] 240.132454: 3997:120:R + 4055:120:S bash 3997 [01] 240.132457: 3997:120:R ==> 4055:120:R sleep 4055 [01] 240.132460: 4055:120:D ==> 3997:120:R 8
Ftrace: Trace-cmd # trace cmd record e sched_switch sleep 5 # trace cmd report version = 6 cpus=2 sleep 29568 [000] 2930217.458882:sched_switch:trace cmd:29568 [120] R ==> evince:28664 [120] evince 28664 [000] 2930217.458977:sched_switch:evince:28664 [120] S ==> trace cmd:29568 [120] sleep 29568 [000] 2930217.459349:sched_switch:sleep:29568 [120] R ==> pulseaudio:1909 [120] pulseaudio 1909 [000] 2930217.459516:sched_switch:pulseaudio:1909 [120] S ==> sleep:29568 [120] sleep 29568 [000] 2930217.460463:sched_switch: sleep:29568 [120] S ==> trace cmd:29566 [120] trace cmd 29566 [000] 2930217.460510:sched_switch:trace cmd:29566 [120] S ==> trace cmd:29567 [120] trace cmd 29567 [000] 2930217.460543:sched_switch:trace cmd:29567 [120] S ==> swapper:0 [120] <idle> 0 [000] 2930217.460710:sched_switch:swapper:0 [120] R ==> Xorg:1675 [120] 9
10 Ftrace GUI: kernelshark
Ftrace: Dynamic Tracing Added by Masami Hiramatsu (end of 2009) Real dynamic tracing Use /sys/kernel/debug/tracing/kprobe_events to control from command line Flexible to an extent, but hard to use because of syntax Read more: Documentation/trace/kprobetrace.txt LWN article: http://lwn.net/articles/343766/ echo 'r:myretprobe do_sys_open $retval' > /sys/kernel/debug/tracing/kprobe_events echo > /sys/kernel/debug/tracing/kprobe_events 11
Perf In kernel (tools/perf directory) userspace tool (started in 2008) Main contributors : Thomas Gleixner, Ingo Molnar, Frederic Weisebecker, Arnaldo Charvalo De Melo, Peter Zijlstra and others Started as Hardware performance counters interface, initially called perf counters. Originally on top of the hardware-based performance monitoring interface for Linux by Stephane Eranian. Now available as libpfm version 4 http://perfmon2.sourceforge.net/docs_v4.html Has grown into all encompassing tracing system Very active at the moment No user space tracing yet Documentation: tools/perf/design.txt and tools/perf/documenation Perf wiki: https://perf.wiki.kernel.org/index.php/main_page 12
Perf subcommands Perf stat: collects and display events data during a command execution Perf record: run a command, store its profiling (sampling mode) in output file (perf.data) Perf report: display data previously recorded in output file (perf.data) Perf diff: diff between two perf.data files Perf top: performance counter profile in real time Perf probe: define dynamic tracepoints (more on this later) Perf kmem: show statistics on kernel mem (from perf.data), or record kernel mem usage for a command Perf trace-perl (trace-python): process previously recorded trace data using a perl script (python script) 13
Perf subcommands (continued) Perf list: list available symbolic events types (in exadecimal encoding) Perf annotate: display perf.data with assembly or source code of the command executed Perf lock: lock statistics Perf sched: scheduler behaviour Perf kvm: perform same as above but on a kvm guest environment with subcommands: Perf kvm top Perf kvm record Perf kvm report Perf kvm diff 14
Perf: Types of events Perf list shows the list of predefined events that are available Use one or more of the events in perf stat commands: perf stat -e instructions sleep 5 Hardware: cpu cycles, instructions, cache references, cache misses, branch instructions, branch misses, bus cycles. Uses libpfm. CPU model specific. Hardware cache: for instance L1-dcache-loads, L1-icache-prefetches Hardware breakpoint Raw: hexadecimal encoding provided by harware vendor Software: in kernel events. Such as: cpu clock, task clock, page faults, context switches, cpu migrations, page faults, alignment faults, emulation faults. Static tracepoints: needs the ftrace event tracer enabled. Events are listed in /sys/kernel/debug/tracing/events/* (locations defined with TRACE_EVENT() macros) Dynamic tracepoints (if any are defined) See include/linux/perf_event.h 15
Other Control Parameters Process specific: default mode, perf follows fork() and pthread_create() calls Thread specific: use - - no-inherit (or -i) option System wide: use -a option CPU specific: use -C with list of CPU numbers For running process: -p <pid> To find what binaries were running, from a perf.data file, use Elf unique identifier inserted by linker: perf buildid-list User, Kernel, Hypervisor modes: count events when the CPU is in user, kernel and/or hypervisor mode only. Use with perf stat command. 16
Dynamic Tracing in Perf Contributed by Masami Hiramatsu Probes can be inserted on the fly using perf probe Defined on the command line only Syntax can be simple or complex Can use dwarf debuginfo to show source code (--line option) before setting probe (uses elfutils libdw like systemtap), or to use function name, line number, variable name Abstraction on ftrace kprobes, alleviates usage problems, supports also same syntax to specify probes Some options: --add, --del, --line, --list, --dry-run, --verbose, --force Read more: tools/perf/documentation/perf-probe.txt 17
Example Add probe, run a command, and display the data: #./perf probe -P 'p:myprobe @fs/read_write.c:285 file buf count' Adding new event: p:myprobe vfs_read+57 file=%bx buf=%si count=%ax #./perf list... rnnn [raw hardware event descriptor] kprobes:myprobe [Tracepoint event] #./perf record -f -e kprobes:myprobe:record -F 1 -a ls #./perf trace [...] perf-11445 [000] 95862.048894383: myprobe: (c04bbed5) file=dae15e80 buf=b78b2000 count=400 perf-11445 [000] 95862.049066533: myprobe: (c04bbed5) file=dae15d80 buf=b78b2000 count=400 [...] 18
LTTng External to kernel tree Maintainer Mathieu Desnoyers http://lttng.org/ Userspace tracing via markers and tracepoints (static instrumentation). Need to link with special library. kprobes/tracepoints support for kernel tracing GUI to view events (LTTV) Eclipse integration Included in some embedded Linux distributions Just released Version 2.0 (July 2011) 19
Systemtap External to kernel tree (started in 2005) Multiple maintainers: Red Hat, IBM, others http://sourceware.org/systemtap/wiki Kernel tracing using kprobes Userspace tracing requires in kernel infrastructure that is not in mainline (only Red Hat based kernels) (utrace/uprobes) Dynamic probes and tracepoints Has well defined rich scripting language GUI available Safety issues around modifying memory locations (guru mode) Can use debuginfo (seen as drawback) Uses gcc (seen as drawback) Remote operation allowed 20
Kprobes: Dynamic Tracing Added to the Linux kernel in 2005 Similar to breakpoints: place breakpoint instruction at desired code location and execute actions when hit. Optimizations to kprobes using Jumps instead of exceptions. Used by Systemtap first, now also by ftrace and perf 21
Events Markers Static probe points in kernel code Independent of users (ftrace, perf, Lttng, systemap) Many uses in the kernel (a few hundreds and growing) TRACE_EVENT() macro with 6 arguments Definitions in include/linux/tracepoint.h Define characteristics of tracing actions (probe) using TRACE_EVENT() in include/trace/events/*.h Mark tracing locations with function call trace_my_event_name() e.g. trace_sched_switch() in sched.c and TRACE_EVENT(sched_switch,...) defined in sched.h Read the 3-article series: http://lwn.net/articles/379903/ http://lwn.net/articles/381064/ http://lwn.net/articles/383362/ 22
Utrace Systemtap provides userspace tracing based on utrace. Code in RHEL/Fedora. Utrace: goal was to be replacement for ptrace. Status: development abandoned. Fedora still ships it, used by a few other tools. See mailing list: https://www.redhat.com/mailman/listinfo/utrace-devel See code: http://redhat.com/~roland/utrace/ Maintained by Oleg Nesterov Patch kept up to date with recent kernels, most recent is for kernel 3.1 (August 2011) 23
Uprobes Uprobes: handle userspace breakpoints in kernel Analogous to kprobes Uses breakpoint instruction Posted many revisions No longer based on utrace No signals Multiple tracers allowed Ptrace replacement eventually? Still not approved Latest version of the patchset https://lkml.org/lkml/2011/9/20/123 Integration of previous version of the patch with latest kernel F15 and F16 (from Josh Stone) http://permalink.gmane.org/gmane.linux.systemtap/18118 24
Tracing Infrastructure Unification? Goal: unify the kernel infrastructure for tracing There are separate ring buffers used by perf and ftrace Need more complex interfaces Need to be fast Simpler ring buffers in the kernel (kfifo) for other users also exist Unification effort restarted many times: http://lkml.org/lkml/2010/7/9/368 http://lkml.org/lkml/2010/8/17/485 Some old benchmarks: http://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-ringbuffer.git Status quo with multiple buffers is likely to persist. Too many differing needs. Possible unification of ftrace and perf? Proposal of trace but stalled: https://lkml.org/lkml/2010/11/16/466 http://www.linuxplumbersconf.org/2011/ocw/proposals/609 25
Dtrace A Solaris tool, available since 2005 Ported to Linux by Oracle, initial release October 2011, version 0.1, still WIP Code posted here http://oss.oracle.com/git/ linux-2.6-dtrace-modules-beta.git linux-2.6-dtrace-unbreakable-beta.git Dtrace provider, syscall provider Next: static probes, then tackle userspace tracing Integrated with Oracle Unbreakable Enterprise Kernel, available on UEK 2.6.32 right now (0.1 version), moving to UEK2 version 2.6.39 from now on x86_64 only Kernel changes are GPL Kernel Module is CDDL Unrelated to other Linux porting work by Paul Fox 26
In Summary Systemtap: Moving forward, but on its own. Provides userspace tracing. Resistance by kernel community around architectural design decisions. Integration as is with upstream kernel unlikely? LTTng: Developed actively, but standalone. Provides userspace tracing. Code in its own repo. Not in mainline kernel. Ftrace: continuing, thanks to trace-cmd and kernelshark. Kernel focused Perf: most active project at this time. Kernel focused Unification of Buffers in kernel: stalled Scalability: in progress. See google's comments: http://lwn.net/articles/464268/ Utrace: development stalled/abandoned, maintenance only Uprobes: hopeful In kernel interpreter: maybe? tracepoints/static events: KABI status? Integration of ftrace and perf? trace? Common Trace Format: proposed by Mathieu Desnoyers, but not accepted by Linux community. Evolving on its own with LTTng-2.0 Overall area: kind of stalled. What happened at KS 2011? Nothing much. 27
Thank You Ananth Mavinakayanahalli (IBM) Srikar Dronamraju (IBM) Masami Hiramatsu (Hitachi) Steven Rostedt (Red Hat) Josh Stone and Frank Eigler (Red Hat) 28
29 The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle.