Skip to main content

GPU Profiler Quick Start

What you'll get

Within minutes of following this guide, zymtrace will profile your GPU workload and surface exactly what is slowing it down.

The flamegraph below is from a ResNet training run. No code changes were made. zymtrace found that 25% of total GPU time (285s out of 1154s) was being wasted on NCHW to NHWC layout conversions — cuDNN silently converting tensor memory layouts on every convolution call. The fix is a single line: convert the model and inputs to channels-last format once upfront.

ResNet training flamegraph showing NCHW to NHWC layout conversion hotspot

That level of insight — tied directly to the code path responsible — is what this guide gets you to.

This guide walks you through getting the zymtrace GPU profiler up and running quickly. Choose the deployment method that matches your environment.

Assumptions

This guide assumes you already have the zymtrace backend installed. If not, set that up first.

Prerequisites​

  • Linux x86_64 or arm64
  • NVIDIA GPU with CUDA 12.x support

Installation​

Replace ZYMTRACE_URL with your zymtrace backend address (e.g. zymtrace.company-domain.com:443).

Binary/Executable​

The simplest way to get started on a single machine.

1. Download and extract​

# For x86_64
curl -LO https://dl.zystem.io/zymtrace/26.4.4/amd64/zymtrace-profiler.tar.gz

# For arm64
# curl -LO https://dl.zystem.io/zymtrace/26.4.4/arm64/zymtrace-profiler.tar.gz

sudo tar -xzvf zymtrace-profiler.tar.gz -C / --no-same-owner

2. Start the profiler​

sudo /opt/zymtrace/profiler/zymtrace-profiler \
--collection-agent ZYMTRACE_URL \
--enable-gpu-metrics \
--nvml-auto-scan
tip

If TLS is not enabled (e.g. NodePort or local setup), add --disable-tls:

sudo /opt/zymtrace/profiler/zymtrace-profiler \
--disable-tls \
--collection-agent ZYMTRACE_URL \
--enable-gpu-metrics \
--nvml-auto-scan

3. Run your application​

Basic GPU profiling (no privileged access required):

env CUDA_INJECTION64_PATH="/opt/zymtrace/profiler/libzymtracecudaprofiler.so" \
python -u your_script.py

With PC sampling (requires sudo, see PC Sampling):

sudo env RUST_LOG="zymtracecudaprofiler=info" \
CUDA_INJECTION64_PATH="/opt/zymtrace/profiler/libzymtracecudaprofiler.so" \
ZYMTRACE_CUDAPROFILER__ENABLE_PC_SAMPLING="true" \
python -u your_script.py

PC Sampling (Optional)​

PC sampling provides the deepest level of GPU performance insight, including stall reasons, SASS disassembly, and memory offsets. However, NVIDIA requires privileged access (sudo or --privileged) for PC sampling due to a security vulnerability (CVE-2024-0090). We recommend enabling PC sampling in development environments and on-demand in production when deeper analysis is needed.

See the commands labeled With PC sampling in each installation tab above.

Next steps​