Rust Profiling Essentials with perf

A: Sampling the program at specific time, and do some statistics analysis.

It can be one of the following:

  • Reading Backtraces of the program for every 1000th cycles, and represent in flamegraph.
  • Reading Backtraces of the program for every 1000th cache miss, and represent in flamegraph.
  • Reading Backtraces of the program for every 10th memory allocation, and represent in flamegraph.
  • Get return address of the program for every 10th memory allocation, and show counts for every line.

Kind of triggers:

  • Hardware Interrupts:
    • Performance Monitoring Unit (PMU): A hardware unit that can be programmed to count hardware events.
    • Time-based interrupts: Timer interrupt, which can be programmed to interrupt the CPU at a specific interval.
  • Software Interrupts:
    • Profiling API: A software API that can be used to trigger a sample. kprobes, uprobes, perf_event_open, etc.

A hardware unit that can be programmed to count hardware events. This is the most accurate way to gather data, but it is also the most complex way. It might differ from CPU to CPU, and it might require root access to use.

  • Pros:
    • Can be used as a counter to track various events, like:
      • cache misses
      • CPU cycles
      • L1-dcache-misses
  • Cons:
    • Requires root access.
    • Configuration may vary between different CPUs (platform-dependent).

The counter is a collection of Model-Specific Registers (MSR) , which are read/write through RDMSR/WRMSR like other MSRs. They count fixed or configurable events.

  • Fixed: Counts for predefined events such as core clock and retired instructions.
  • Configurable: Can be set up through another MSR to count a specific event.

Here’s a sample for (default event) CPU cycles at 999Hz

1
2
3
$ perf record -F 999
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 2.797 MB perf.data (5131 samples) ]

Here’s another Sample for L1-dcache-misses at 99Hz

1
2
3
$ perf record -F 99 -e L1-dcache-misses
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 2.549 MB perf.data (606 samples) ]

But how do we make sure our sample is accurate, distributed evenly, and not biased? We know sometime CPU might be busy(or sleep, etc.), and the sample might be missed. How do we make sure the sample is accurate?

The answer is - Linux. Linux is smart enought to automatically adjust the sample rate to make sure the sample is accurate.

For example, If we want to sample at 100Hz(10ms), but in the previous 1ms, the delta_count is 5, then the period should be 50. This value is tuned for every 1ms (by default, HZ=1000)

Exact formula: $$ period = \frac{delta\_count \times 10^{9}}{delta\_nsec \times sample\_freq} $$

Though PMU is great, some profilers use the event based on hrtimers (especially setitimer syscall).

1
2
setitimer(ITIMER_PROF, &interval, &old_interval);
  // -> Send SIGPROF everytime the timer hits

Profilers utilize time-based interrupts to gather data, such as gperftools , go perf , pprof-rs , and others.

  • Pros:
    • Easy to use and cross-platform (as setitimer , a POSIX function)
    • Real-time stack unwinding capability reduces memory and disk traffic overhead.
  • Cons:
    • Less accuracy compared to PMU-based profilers.
    • Incurs greater overhead than PMU-based profilers, making it better suited for high-level profiling. Localized profiling might result in biases and may be less effective.

The differnce between HW/SW, is who triggers the sample.

The kernel rewrites certain instructions, such as the entry of the function, to INT 3. When reading related pages, these instructions are automatically replaced. Upon encountering this interrupt, the system catches it and increments a counter. Once this counter meets or exceeds a predefined period, a sample is taken. Subsequently, the counter is reset, allowing the program to continue its execution.

1
2
3
4
5
6
7
8
9
$ sudo perf probe -x /lib/libc.so.6 --add malloc
Added new events:
probe_libc:malloc  (on malloc in /usr/lib/libc.so.6)
probe_libc:malloc  (on malloc in /usr/lib/libc.so.6)

# We can now use it in all perf tools, such as:
$ sudo perf record -e probe_libc:malloc -F 999
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.520 MB perf.data ]

After discussing interrupts, we now turn to how to gather the data. For every interrupt or signal handler, we have access to:

  • The instruction register.
  • All general-purpose registers.
  • System information of the current program, including process ID (PID), memory, etc.

Obtaining the instruction address is relatively straightforward. However, the challenge lies in acquiring the backtrace. We can use something like Frame Pointer or DWARF or LBR.

This method provides a straightforward approach to obtaining the backtrace, although it is not commonly used today. For more details, please refer to this resource .

Rely on compiler to generate the debug information, and use it to unwind the stack. Take this piece of code as an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
fn add1(x: i32) -> i32 {
    x + 1  // 0x0a
}

fn add2(x: i32) -> i32 {
    add1(x) + 2  // 0x1f call
}

fn main() {
    return add2(1);  // 0x3a call
}

This a table similiar to DWARF format that specifies how to set registers to restore the previous frame:

Loc CFA RA EDI
0x0a esp+4 *cfa *(cfa-4)
0x1f esp+4 *cfa *(cfa-4)

But DWARF use a more compact format to save the space, here’s a sample:

1
2
3
4
5
6
7
0x00000250:   DW_CFA_advance_loc: 1 to 0x00000251
0x00000251:   DW_CFA_def_cfa_offset: 16
0x00000254:   DW_CFA_offset: r6, -16
0x00000257:   DW_CFA_advance_loc: 3 to 0x0000025a
0x0000025a:   DW_CFA_def_cfa_register: r6
0x0000025d:   DW_CFA_advance_loc: 5 to 0x0000025f
0x0000025f:   DW_CFA_offset: r14, -8

The DWARF format offers enhanced compatibility across various compilers and architectures. However, this comes with trade-offs, such as occasional compiler bugs and the limitation that it can only operate within user space to avoid potential kernel crashes. To unwind the stack using DWARF, perf replicates the entire stack for each sample collected. This approach is not ideal for high-frequency sampling, as it can lead to significant overhead.

Frame Pointer DWARF
Pros Fast and lightweight Provides detailed and accurate stack traces if well-supported by the compiler
Simple, widely supported Compatible with various compilers and architectures
Effective in both user and kernel space Better for complex debugging and optimized code
Cons Limited accuracy with optimized code Higher overhead due to detailed information processing
Less detailed stack information Primarily for user space; risky for kernel space
Slower and more resource-intensive

A flamegraph is a visualization tool that represents the data we generated using tools mentioned above. It is widely used in profiling to display the stack traces of a program. The x-axis represents the stack depth, while the y-axis displays the function name. The width of each box corresponds to the number of samples taken for that function. The flamegraph is an effective way to visualize the performance of a program and identify bottlenecks. In Rust ecosystem, there’re some existed tools that can help to generate flamegraph, such as:

In conclusion, we have discussed the essentials of profiling, including the various methods to trigger samples and gather data. We have also explored the different ways to obtain backtraces, such as using frame pointers or DWARF. Each method has its own set of pros and cons, and the choice of method depends on the specific requirements of the profiling task at hand. By understanding these concepts, we can better optimize our profiling workflow and gain valuable insights into our program’s performance.

This article is inspired by YangKeao presentation

Related Content