注意

单击此处下载完整的示例代码

对 PyTorch 模块进行性能分析¶

创建时间： 2020年12月30日 |上次更新时间：2024 年 1 月 19 日 |上次验证： Nov 05， 2024

作者： Suraj Subramanian

PyTorch 包含一个分析器 API，可用于识别时间和代码中各种 PyTorch作的内存成本。Profiler 可以是轻松集成到您的代码中，结果可以打印为表格或在 JSON 跟踪文件中返回。

注意

Profiler 支持多线程模型。Profiler 在与作相同的线程，但它也会分析子运算符这可能会在另一个线程中运行。并发运行的分析器将是的范围限定为自己的线程，以防止结果混合。

注意

PyTorch 1.8 引入了新的 API，它将取代旧的分析器 API 在未来的版本中。在此页面上查看新的 API。

前往这个快速演练 Profiler API 使用情况的配方。

import torch
import numpy as np
from torch import nn
import torch.autograd.profiler as profiler

使用 Profiler 进行性能调试¶

Profiler 可用于识别模型。在此示例中，我们构建了一个自定义模块，该模块执行两个子任务：

输入的线性变换，以及
使用 Transformation Result 获取掩码张量的索引。

我们使用将每个子任务的代码包装在单独标记的上下文管理器中。在 profiler 输出中，聚合子任务中所有作的性能指标将显示在其相应的标签下。profiler.record_function("label")

请注意，使用 Profiler 会产生一些开销，并且最好仅用于调查法典。如果要对运行时进行基准测试，请记住将其删除。

class MyModule(nn.Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super(MyModule, self).__init__()
        self.linear = nn.Linear(in_features, out_features, bias)

    def forward(self, input, mask):
        with profiler.record_function("LINEAR PASS"):
            out = self.linear(input)

        with profiler.record_function("MASK INDICES"):
            threshold = out.sum(axis=1).mean().item()
            hi_idx = np.argwhere(mask.cpu().numpy() > threshold)
            hi_idx = torch.from_numpy(hi_idx).cuda()

        return out, hi_idx

分析前向传递¶

我们初始化随机输入和掩码张量以及模型。

在运行分析器之前，我们会预热 CUDA 以确保准确性能基准测试。我们将模块的 forward pass 包装在 context manager 中。该参数将文件和跟踪中作的行号。profiler.profilewith_stack=True

警告

with_stack=True会产生额外的开销，更适合于调查代码。如果要对性能进行基准测试，请记住将其删除。

model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.double).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

打印性能分析器结果¶

最后，我们打印 profiler 结果。按运算符名称聚合结果，也可以按输入聚合结果形状和/或堆栈跟踪事件。按输入形状分组有助于识别哪些张量形状被模型利用。profiler.key_averages

在这里，我们使用 which 通过作及其回溯（截断为最近的 5 个事件），以及按事件的注册顺序显示事件。该表还可以通过传递参数进行排序（请参阅文档有效的排序键）。group_by_stack_n=5sort_by

注意

在笔记本中运行 profiler 时，您可能会在堆栈跟踪中看到 like 而不是 filenames 的条目。这些对应于。<ipython-input-18-193a910735e8>(13): forward<notebook-cell>(line number): calling-function

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

"""
(Some columns are omitted)

-------------  ------------  ------------  ------------  ---------------------------------
         Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
-------------  ------------  ------------  ------------  ---------------------------------
 MASK INDICES        87.88%        5.212s    -953.67 Mb  /mnt/xarfuse/.../torch/au
                                                         <ipython-input-...>(10): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/

  aten::copy_        12.07%     715.848ms           0 b  <ipython-input-...>(12): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/
                                                         /mnt/xarfuse/.../IPython/

  LINEAR PASS         0.01%     350.151us         -20 b  /mnt/xarfuse/.../torch/au
                                                         <ipython-input-...>(7): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/

  aten::addmm         0.00%     293.342us           0 b  /mnt/xarfuse/.../torch/nn
                                                         /mnt/xarfuse/.../torch/nn
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(8): forward
                                                         /mnt/xarfuse/.../torch/nn

   aten::mean         0.00%     235.095us           0 b  <ipython-input-...>(11): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/
                                                         /mnt/xarfuse/.../IPython/

-----------------------------  ------------  ---------- ----------------------------------
Self CPU time total: 5.931s

"""

提高内存性能¶

请注意，最昂贵的作 - 就内存和时间而言 - 表示 MASK INDICES 中的作。让我们尝试首先解决内存消耗问题。我们可以看到第 12 行的作消耗了 953.67 Mb。此作将复制到 CPU。使用 datatype 进行初始化。我们能否通过强制转换来减少内存占用它反而？forward (10).to()maskmasktorch.doubletorch.float

model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

"""
(Some columns are omitted)

-----------------  ------------  ------------  ------------  --------------------------------
             Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
-----------------  ------------  ------------  ------------  --------------------------------
     MASK INDICES        93.61%        5.006s    -476.84 Mb  /mnt/xarfuse/.../torch/au
                                                             <ipython-input-...>(10): forward
                                                             /mnt/xarfuse/  /torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/

      aten::copy_         6.34%     338.759ms           0 b  <ipython-input-...>(12): forward
                                                             /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/
                                                             /mnt/xarfuse/.../IPython/

 aten::as_strided         0.01%     281.808us           0 b  <ipython-input-...>(11): forward
                                                             /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/
                                                             /mnt/xarfuse/.../IPython/

      aten::addmm         0.01%     275.721us           0 b  /mnt/xarfuse/.../torch/nn
                                                             /mnt/xarfuse/.../torch/nn
                                                             /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(8): forward
                                                             /mnt/xarfuse/.../torch/nn

      aten::_local        0.01%     268.650us           0 b  <ipython-input-...>(11): forward
      _scalar_dense                                          /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/
                                                             /mnt/xarfuse/.../IPython/

-----------------  ------------  ------------  ------------  --------------------------------
Self CPU time total: 5.347s

"""

此作的 CPU 内存占用量已减半。

提高时间性能¶

虽然消耗的时间也减少了一点，但仍然太高了。事实证明，将矩阵从 CUDA 复制到 CPU 非常昂贵！复制到 CPU 中的运算符以便它可以使用 NumPy 函数。 at 将数组作为张量复制回 CUDA 。如果我们在这里使用一个函数，我们可以消除这两个。aten::copy_forward (12)maskargwhereaten::copy_forward(13)torchnonzero()

class MyModule(nn.Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super(MyModule, self).__init__()
        self.linear = nn.Linear(in_features, out_features, bias)

    def forward(self, input, mask):
        with profiler.record_function("LINEAR PASS"):
            out = self.linear(input)

        with profiler.record_function("MASK INDICES"):
            threshold = out.sum(axis=1).mean()
            hi_idx = (mask > threshold).nonzero(as_tuple=True)

        return out, hi_idx


model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

"""
(Some columns are omitted)

--------------  ------------  ------------  ------------  ---------------------------------
          Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
--------------  ------------  ------------  ------------  ---------------------------------
      aten::gt        57.17%     129.089ms           0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/

 aten::nonzero        37.38%      84.402ms           0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/

   INDEX SCORE         3.32%       7.491ms    -119.21 Mb  /mnt/xarfuse/.../torch/au
                                                          <ipython-input-...>(10): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/

aten::as_strided         0.20%    441.587us          0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/

 aten::nonzero
     _numpy             0.18%     395.602us           0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/
--------------  ------------  ------------  ------------  ---------------------------------
Self CPU time total: 225.801ms

"""

延伸阅读¶

我们已经了解了如何使用 Profiler 来调查 PyTorch 模型中的时间和内存瓶颈。在此处阅读有关 Profiler 的更多信息：

脚本总运行时间：（0 分 0.000 秒）

由 Sphinx-Gallery 生成的图库