torch.profiler¶

概述¶

PyTorch Profiler 是一种工具，允许在训练和推理期间收集性能指标。 Profiler 的上下文管理器 API 可用于更好地了解哪些模型运算符最昂贵。检查它们的输入形状和堆栈跟踪，研究设备内核活动并可视化执行跟踪。

注意

API 的早期版本在torch.autogradmodule 被视为 legacy 并将被弃用。

API 参考¶

类 torch.profiler 中。_KinetoProfile（*， activities=None， record_shapes=False， profile_memory=False， with_stack=False、with_flops=False、with_modules=False、experimental_config=无、execution_trace_observer=无、acc_events=False）[来源]¶

低级分析器包装 autograd 配置文件

参数

activities （iterable） – 用于分析的活动组（CPU、CUDA）列表，支持的值：，， . 默认值：ProfilerActivity.CPU 和 ProfilerActivity.CUDA（如果可用）或 ProfilerActivity.XPU（如果可用）。torch.profiler.ProfilerActivity.CPUtorch.profiler.ProfilerActivity.CUDAtorch.profiler.ProfilerActivity.XPU
record_shapes （bool） – 保存有关运算符输入形状的信息。
profile_memory （bool） – 跟踪张量内存分配/释放（有关详细信息，请参阅）。export_memory_timeline
with_stack （bool） – 记录运算的源信息（文件和行号）。
with_flops （bool） – 使用公式估算特定运算符的 FLOPS （矩阵乘法和 2D 卷积）。
with_modules （bool） – 记录模块层次结构（包括函数名称）对应于 op 的 callstack。例如，如果模块 A 的 forward 调用的模块 B 的 forward 包含一个 aten：：add作，那么 aten：：add 的模块层次结构是 A.B 请注意，目前仅对 TorchScript 模型提供此支持而不是 Eager Mode 模型。
experimental_config （_ExperimentalConfig） – 一组实验性选项由 Kineto 等分析器库使用。请注意，不能保证向后兼容性。
execution_trace_observer （ExecutionTraceObserver） – PyTorch 执行跟踪观察器对象。PyTorch 执行跟踪提供基于图形的表示 AI/ML 工作负载，并支持重放基准测试、模拟器和模拟器。当包含此参数时，将为与 PyTorch 分析器相同的时间窗口。
acc_events （bool） – 在多个分析周期中启用 FunctionEvents 的累积

注意

此 API 是实验性的，将来可能会更改。

启用 shape 和 stack 跟踪会导致额外的开销。指定 record_shapes=True 时，分析器将暂时保存对张量的引用; 这可能会进一步阻止某些依赖于引用计数的优化，并引入额外的 Tensor 副本。

add_metadata（键，值）[来源]¶

添加具有字符串键和字符串值的用户定义元数据到跟踪文件中

add_metadata_json（键，值）[来源]¶

添加用户定义的元数据，其中包含字符串键和有效的 json 值到跟踪文件中

events（）[来源]¶: 返回未聚合的 Profiler 事件列表，在 trace 回调中使用或在性能分析完成后使用

export_chrome_trace（path）[来源]¶

以 Chrome JSON 格式导出收集的跟踪记录。如果启用了 kineto，则仅导出 Schedule 中的 Last Cycle。

export_memory_timeline（path， device=None）[来源]¶

从收集的分析器中导出内存事件信息树，并导出时间线图。共有 3 个可导出文件，每个文件都由的后缀控制。export_memory_timelinepath

对于 HTML 兼容的绘图，请使用后缀和内存时间线 plot 将作为 PNG 文件嵌入到 HTML 文件中。.html
对于由组成的绘图点，其中是时间戳，是每个类别的内存使用情况。内存时间线图将保存为 JSON （）或 gzip 压缩的 JSON （）取决于后缀。[times, [sizes by category]]timessizes.json.json.gz
对于原始内存点，请使用后缀。每个原始内存 event 将由组成，其中是之一。，并且是中的枚举之一。.raw.json.gz(timestamp, action, numbytes, category)action[PREEXISTING, CREATE, INCREMENT_VERSION, DESTROY]categorytorch.profiler._memory_profiler.Category

输出：写入 gzip 压缩的 JSON、JSON 或 HTML 的内存时间线。

export_stacks（path， metric='self_cpu_time_total'）[来源]¶

将堆栈跟踪保存到文件

参数

path （str） – 将堆栈文件保存到此位置;
metric （str） – 要使用的度量：“self_cpu_time_total”或“self_cuda_time_total”

key_averages（group_by_input_shape=False， group_by_stack_n=0）[来源]¶

对事件进行平均，按运算符名称和（可选）输入形状对事件进行分组，以及叠。

注意

要使用形状/堆栈功能，请确保将 record_shapes/with_stack 创建 Profiler Context Manager 时。

preset_metadata_json（键，值）[来源]¶

在未启动 Profiler 时预设用户定义的元数据并稍后添加到 trace 文件中。元数据采用字符串键和有效 json 值的格式

toggle_collection_dynamic（启用、活动）[来源]¶

在收集的任意点打开/关闭活动集合。目前支持切换 Torch Ops Kineto 中支持的（CPU）和 CUDA 活动

参数: activities （iterable） – 用于性能分析的活动组列表，支持的值：，torch.profiler.ProfilerActivity.CPUtorch.profiler.ProfilerActivity.CUDA

例子：

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    code_to_profile_0()
    // turn off collection of all CUDA activity
    p.toggle_collection_dynamic(False, [torch.profiler.ProfilerActivity.CUDA])
    code_to_profile_1()
    // turn on collection of all CUDA activity
    p.toggle_collection_dynamic(True, [torch.profiler.ProfilerActivity.CUDA])
    code_to_profile_2()
print(p.key_averages().table(
    sort_by="self_cuda_time_total", row_limit=-1))

类 torch.profiler 中。profile（*， activities=无， schedule=无， on_trace_ready=无， record_shapes=False、profile_memory=False、with_stack=False、with_flops=False、with_modules=False、experimental_config=无，execution_trace_observer=无，acc_events=False，use_cuda=无）[来源]¶

Profiler 上下文管理器。

参数

activities （iterable） – 用于分析的活动组（CPU、CUDA）列表，支持的值：，， . 默认值：ProfilerActivity.CPU 和 ProfilerActivity.CUDA（如果可用）或 ProfilerActivity.XPU（如果可用）。torch.profiler.ProfilerActivity.CPUtorch.profiler.ProfilerActivity.CUDAtorch.profiler.ProfilerActivity.XPU
schedule （Callable） – 将步骤（int）作为单个参数并返回指定在每个步骤中要执行的分析器作的值的可调用。ProfilerAction
on_trace_ready （Callable） – 在分析期间返回时，在每个步骤中调用的可调用对象。scheduleProfilerAction.RECORD_AND_SAVE
record_shapes （bool） – 保存有关运算符输入形状的信息。
profile_memory （bool） - 跟踪张量内存分配/释放。
with_stack （bool） – 记录运算的源信息（文件和行号）。
with_flops （bool） – 使用公式估计特定运算符的 FLOPs （浮点运算）（矩阵乘法和 2D 卷积）。
with_modules （bool） – 记录模块层次结构（包括函数名称）对应于 op 的 callstack。例如，如果模块 A 的 forward 调用的模块 B 的 forward 包含一个 aten：：add作，那么 aten：：add 的模块层次结构是 A.B 请注意，目前仅对 TorchScript 模型提供此支持而不是 Eager Mode 模型。
experimental_config （_ExperimentalConfig） – 一组实验性选项用于 Kineto 库功能。请注意，不能保证向后兼容性。
execution_trace_observer （ExecutionTraceObserver） – PyTorch 执行跟踪观察器对象。PyTorch 执行跟踪提供基于图形的表示 AI/ML 工作负载，并支持重放基准测试、模拟器和模拟器。当包含此参数时，将为与 PyTorch 分析器相同的时间窗口。有关代码示例，请参阅下面的示例部分。
acc_events （bool） – 在多个分析周期中启用 FunctionEvents 的累积
use_cuda （布尔值） –

1.8.1 版后已移除： use instead.activities

注意

用schedule()以生成可调用的计划。非默认计划在分析长时间训练作业时非常有用并允许用户在不同迭代时获取多个跟踪的训练过程。默认计划只是连续记录 Duration 的上下文管理器。

注意

用tensorboard_trace_handler()要为 TensorBoard 生成结果文件，请执行以下作：

on_trace_ready=torch.profiler.tensorboard_trace_handler(dir_name)

分析后，可以在指定的目录中找到结果文件。使用命令：

tensorboard --logdir dir_name

以查看 TensorBoard 中的结果。有关更多信息，请参阅 PyTorch Profiler TensorBoard 插件

注意

启用 shape 和 stack 跟踪会导致额外的开销。指定 record_shapes=True 时，分析器将暂时保存对张量的引用; 这可能会进一步阻止某些依赖于引用计数的优化，并引入额外的 Tensor 副本。

例子：

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    code_to_profile()
print(p.key_averages().table(
    sort_by="self_cuda_time_total", row_limit=-1))

使用 Profiler 的和函数：scheduleon_trace_readystep

# Non-default profiler schedule allows user to turn profiler on and off
# on different iterations of the training loop;
# trace_handler is called every time a new trace becomes available
def trace_handler(prof):
    print(prof.key_averages().table(
        sort_by="self_cuda_time_total", row_limit=-1))
    # prof.export_chrome_trace("/tmp/test_trace_" + str(prof.step_num) + ".json")

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],

    # In this example with wait=1, warmup=1, active=2, repeat=1,
    # profiler will skip the first step/iteration,
    # start warming up on the second, record
    # the third and the forth iterations,
    # after which the trace will become available
    # and on_trace_ready (when set) is called;
    # the cycle repeats starting with the next step

    schedule=torch.profiler.schedule(
        wait=1,
        warmup=1,
        active=2,
        repeat=1),
    on_trace_ready=trace_handler
    # on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')
    # used when outputting for tensorboard
    ) as p:
        for iter in range(N):
            code_iteration_to_profile(iter)
            # send a signal to the profiler that the next iteration has started
            p.step()

以下示例说明如何设置 Execution Trace Observer （execution_trace_observer)

with torch.profiler.profile(
    ...
    execution_trace_observer=(
        ExecutionTraceObserver().register_callback("./execution_trace.json")
    ),
) as p:
    for iter in range(N):
        code_iteration_to_profile(iter)
        p.step()

你也可以在 tests/profiler/test_profiler.py 中引用 test_execution_trace_with_kineto（）。注意：也可以传递满足 _ITraceObserver 接口的任何对象。

step（）[来源]¶: 向探查器发出下一个性能分析步骤已启动的信号。

类 torch.profiler 中。ProfilerAction（value）[来源]¶: 可以按指定间隔执行的 Profiler作

类 torch.profiler 中。ProfilerActivity¶

成员：

中央处理器

XPU 系列

MTIA

CUDA 的

PrivateUse1

属性名称¶

torch.profiler 中。schedule（*， wait， warmup， active， repeat=0， skip_first=0）[来源]¶

返回可用作 profiler 参数的可调用对象。分析器将跳过第一步，然后等待步骤，然后为后续步骤做热身，然后为后续步骤进行活动录制，然后重复循环从步骤开始. 可选的循环数由参数指定，零值表示这些循环将继续进行，直到性能分析完成。scheduleskip_firstwaitwarmupactivewaitrepeat

返回类型: 调用

torch.profiler 中。tensorboard_trace_handler（dir_name， worker_name=无， use_gzip=False）[来源]¶

将跟踪文件输出到的目录，则该目录可以是作为 logdir 直接交付到 TensorBoard。对于分布式场景中的每个 worker 应该是唯一的，默认情况下，它将设置为 '[hostname]_[pid]”。dir_nameworker_name

Intel Instrumentation and Tracing Technology API¶

torch.profiler.itt 中。is_available（）[来源]¶: 检查 ITT 功能是否可用

torch.profiler.itt 中。mark（msg）[来源]¶

描述在某个时间点发生的瞬时事件。

参数: msg （str） – 与事件关联的 ASCII 消息。

torch.profiler.itt 中。range_push（msg）[来源]¶

将范围推送到嵌套范围 span 的堆栈上。返回从 0 开始的开始的范围的深度。

参数: msg （str） – 与 range 关联的 ASCII 消息

torch.profiler.itt 中。range_pop（）[来源]¶: 从嵌套范围 span 堆栈中弹出一个范围。返回结束的范围的从零开始的深度。

torch.profiler¶

概述¶

API 参考¶

Intel Instrumentation and Tracing Technology API¶

文档

教程

资源