Huge thanks to our Platinum Members Endace and LiveAction,
and our Silver Member Veeam, for supporting the Wireshark Foundation and project.

Wireshark-dev: [Wireshark-dev] [RFC] Adding plugin for Linux Ftrace

From: Steven Rostedt <rostedt@xxxxxxxxxxx>
Date: Wed, 25 Nov 2009 15:31:26 -0500
Hi all,

I'm the Linux Kernel Ftrace maintainer. If you do not know what ftrace
is, you can take a look at:

http://people.redhat.com/srostedt/ftrace-tutorial-linux-con-2009.odp

or this:

http://lwn.net/Articles/290277/


Anyway, ftrace is a nice tool that does internal tracing of the kernel.
It features a full function tracer, latency tracers and event tracing.
Events are being added quite frequently by other kernel maintainers so
that they can see what is happening in their system and fine tune it.

ftrace requires no user space tool outside of cat and echo. Currently
the ftrace files exist inside the debugfs file system.

# mkdir /debug
# mount -t debugfs nodev /debug
# ls -l /debug/tracing
available_events            per_cpu             stack_trace
available_filter_functions  printk_formats      trace
available_tracers           README              trace_clock
buffer_size_kb              saved_cmdlines      trace_marker
current_tracer              set_event           trace_options
dyn_ftrace_total_info       set_ftrace_filter   trace_pipe
events                      set_ftrace_notrace  trace_stat
failures                    set_ftrace_pid      tracing_cpumask
function_profile_enabled    set_graph_function  tracing_enabled
options                     stack_max_size      tracing_on


For scheduling events, I only need to do:

 # echo 1 > /debug/tracing/events/sched/enable
 # cat /debug/tracing/trace
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
            bash-6594  [002] 53067.166361: sched_wakeup: task bash:6594 [140] success=0 [002]
            bash-6594  [002] 53067.166364: sched_stat_runtime: task: bash:6594 runtime: 730858 [ns], vruntime: 669100335850 [ns]
            bash-6594  [002] 53067.166396: sched_wakeup: task bash:6594 [140] success=0 [002]
            bash-6594  [002] 53067.166397: sched_stat_runtime: task: bash:6594 runtime: 37564 [ns], vruntime: 669100373414 [ns]
            bash-6594  [002] 53067.166541: sched_wakeup: task bash:6594 [140] success=0 [002]
            bash-6594  [002] 53067.166542: sched_stat_runtime: task: bash:6594 runtime: 145028 [ns], vruntime: 669100518442 [ns]

For function tracing:

 # echo function > /debug/tracing/current_tracer
 # cat /debug/tracing/trace
# tracer: function
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
          <idle>-0     [002] 53130.839303: smp_apic_timer_interrupt <-apic_timer_interrupt
          <idle>-0     [002] 53130.839304: native_apic_mem_write <-smp_apic_timer_interrupt
          <idle>-0     [002] 53130.839305: exit_idle <-smp_apic_timer_interrupt
          <idle>-0     [002] 53130.839305: __rcu_read_lock <-__atomic_notifier_call_chain
          <idle>-0     [002] 53130.839306: __rcu_read_unlock <-__atomic_notifier_call_chain
          <idle>-0     [002] 53130.839306: irq_enter <-smp_apic_timer_interrupt


And for function graph tracing:

 # echo function_graph > /debug/tracing/current_tracer
 # cat /debug/tracing/trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 3)   ==========> |
 3)               |  smp_apic_timer_interrupt() {
 3)   1.354 us    |    native_apic_mem_write();
 3)   1.326 us    |    exit_idle();
 3)               |    irq_enter() {
 3)   1.274 us    |      rcu_irq_enter();
 3)   1.013 us    |      idle_cpu();
 3)   5.449 us    |    }
 3)               |    hrtimer_interrupt() {
 3)   0.968 us    |      ktime_get();
 3)   1.909 us    |      _spin_lock();
 3)               |      __run_hrtimer() {
 3)   1.051 us    |        __remove_hrtimer();
 3)   1.199 us    |        _spin_unlock();
 3)               |        tick_sched_timer() {
 3)   0.971 us    |          ktime_get();
 3)               |          update_process_times() {


There's lots of options and other fun things that can be done with this
tool.

Ftrace also comes with a binary interface in the

 /debug/tracing/per_cpu/cpu0/trace_pipe_raw

(s/cpu0/cpu1/ for cpu 1, etc)

To read this, a tool is required since the raw output is not something
nice that cat can work with. For this, I wrote "trace-cmd"

 git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git

This is a simple command line tool that reads the trace_pipe_raw files
using splice (very fast kernel to kernel transfer. That is, copying from
the ftrace ring buffer to a file without ever going through user space).

With this tool I can do:

  # trace-cmd record -e all ls -ltr > /dev/null
  # trace-cmd report
version = 0.5
cpus=4
           <...>-6677  [003] 53508.185622: lock_release: &journal->j_state_lock
       trace-cmd-6679  [002] 53508.185623: sys_enter: NR 275 (15, 0, 17, 0, 1000, 1)
           <...>-6680  [000] 53508.185623: sys_enter: NR 275 (15, 0, 17, 0, 1000, 1)
       trace-cmd-6676  [001] 53508.185623: kfree: call_site=ffffffff81092320 ptr=(nil)
       trace-cmd-6676  [001] 53508.185627: sys_exit: NR 1 = 1
           <...>-6677  [003] 53508.185628: lock_release: jbd_handle
       trace-cmd-6679  [002] 53508.185629: kmem_cache_alloc: call_site=ffffffff810a5f58 ptr=0xffff8800270e5d80 bytes_req=32 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO
           <...>-6680  [000] 53508.185629: kmem_cache_alloc: call_site=ffffffff810a5f58 ptr=0xffff88003ea36440 bytes_req=32 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO
       trace-cmd-6676  [001] 53508.185630: sys_enter: NR 3 (14, 7f16140f0000, 1, 22, ffffffff, 0)
       trace-cmd-6679  [002] 53508.185630: kmalloc: call_site=ffffffff810a5f47 ptr=0xffff8800270e5d80 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO
           <...>-6677  [003] 53508.185630: kmem_cache_free: call_site=ffffffffa002efb5 ptr=0xffff88003d901270


This gets more involved with the options trace-cmd can do.

I've been asked several times when there would be a GUI interface, and I
even started to relearn GTK (use to use it a lot in 1999). I was even
going to model it after wireshark.

Finally, it was suggested to me (by Johannes Berg) that we should make a
wireshark plugin that would read all this information. I thought
 "Why not?"

Hence, this email.

I've used wireshark a lot to analyze my home network, but never got
involved with its development. I would like to make a plugin for it that
would allow wireshark to be a GUI front end for ftrace. I think it makes
a very good match. We could use it to enable tracing, just like a start
capture on an interface would do. Then it could display the output. We
could filter the trace. Like networking, ftrace allows for both live
filters and trace-cmd has some limited filtering ability on the saved
data (which would be more trivial to add more).

Johannes has told me that wireshark currently only reads pcap file
formats. So what would be the preferred method to enable ftrace here?
Having it converted to a pcap file format? This is actually quite more
difficult than it sounds. Since the data of each record can change
between kernel to kernel, the formats are exported by the kernel in
the /debug/tracing/events/*/*/format file.

  # cat /debug/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 33
format:
	field:unsigned short common_type;	offset:0;	size:2;
	field:unsigned char common_flags;	offset:2;	size:1;
	field:unsigned char common_preempt_count;	offset:3;	size:1;
	field:int common_pid;	offset:4;	size:4;
	field:int common_lock_depth;	offset:8;	size:4;

	field:char prev_comm[TASK_COMM_LEN];	offset:12;	size:16;
	field:pid_t prev_pid;	offset:28;	size:4;
	field:int prev_prio;	offset:32;	size:4;
	field:long prev_state;	offset:40;	size:8;
	field:char next_comm[TASK_COMM_LEN];	offset:48;	size:16;
	field:pid_t next_pid;	offset:64;	size:4;
	field:int next_prio;	offset:68;	size:4;

print fmt: "task %s:%d [%d] (%s) ==> %s:%d [%d]", REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state ? __print_flags(REC->prev_state, "|", { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" }, { 128, "W" }) : "R", REC->next_comm, REC->next_pid, REC->next_prio

At start of the record, every event format is stored in the trace.dat
file used by trace-cmd. The trace-cmd report parses this information and
uses it to determine how to translate the records in the file.

The preferred method would be to have some way for wireshark be able to
read the trace.dat format.

Anyway, we are at the very beginning and decided to ask the
wireshark-dev list for advice. I'll start hacking various methods to see
what is best.

Thanks for any feedback.

-- Steve