TorchDynamo 疑难解答¶

TorchDynamo 仍在积极开发中，许多原因图表中断和过多的重新编译将在即将到来的支持跟踪动态张量形状 / 更谨慎地选择守卫和更好的启发式方法。

同时，您可能需要诊断特定问题，并且确定更改模型是否易于解决，或者提交 Issue 寻求支持。

此外，我们正在积极开发调试工具、分析器，并改进我们的错误/警告。如果您对此有任何问题，请给我们反馈 infra 或改进的想法。下表列出了可用的工具及其典型用法。有关其他帮助，请参阅诊断运行时错误。

标题¶
工具	目的	用法
信息日志记录	查看编译的摘要步骤	`torch._dynamo.config.log_level = logging.INFO`
调试日志记录	查看编译的详细步骤（打印跟踪的每条指令）	`torch._dynamo.config.log_level = logging.DEBUG`和`torch._dynamo.config.verbose = True`
适用于任何后端的缩小器	查找重现任何后端错误的最小子图	设置环境变量`TORCHDYNAMO_REPRO_AFTER="dynamo"`
缩小器`TorchInductor`	如果已知在 AOTAutograd 找到后发生错误最小的子图在 TorchInductor 降低期间重现错误	设置环境变量`TORCHDYNAMO_REPRO_AFTER="aot"`
Dynamo 精度缩小器	查找重现准确性问题的最小子图在 Eager Model 模型和优化模型之间，当您怀疑问题出在 AOTAutograd 中	`TORCHDYNAMO_REPRO_AFTER="dynamo" TORCHDYNAMO_REPRO_LEVEL=4`
电感器精度缩小器	查找重现准确性问题的最小子图在 Eager Model 模型和优化模型之间，当您怀疑问题出在后端（例如，电感器）。如果这不起作用，请尝试使用 Dynamo 精度缩小器相反。	`TORCHDYNAMO_REPRO_AFTER="aot" TORCHDYNAMO_REPRO_LEVEL=4`
`torch._dynamo.explain`	查找图表分隔线并显示其原因	`torch._dynamo.explain(fn, *inputs)`
录制/重播	录制和重放帧，以便在图形捕获期间重现错误	`torch._dynamo.config.replay_record_enabled = True`
TorchDynamo 函数名称筛选	在以下情况下，仅编译具有给定名称的函数以减少噪声调试问题	设置环境变量`TORCHDYNAMO_DEBUG_FUNCTION=<name>`
TorchInductor 调试日志记录	打印通用 TorchInductor 调试信息和生成的 Triton/C++ 代码	`torch._inductor.config.debug = True`
TorchInductor 跟踪	显示每个 TorchInductor 阶段所花费的时间 + 输出代码和图形可视化	设置环境变量 TORCH_COMPILE_DEBUG=1 或`torch._inductor.config.trace.enabled = True`

诊断运行时错误¶

下面是 TorchDynamo 编译器堆栈。

概括地说，TorchDynamo 堆栈由以下位置的图形捕获组成： Python 代码（TorchDynamo）和后端编译器。在此示例中， backend 编译器由反向图形跟踪（AOTAutograd）和图形降低（TorchInductor）*。错误可能发生在的任何组件中堆栈，并将提供完整的堆栈跟踪。

您可以使用信息日志记录（）并查找输出，以确定发生错误。在每个步骤的开头和结尾制作日志，因此，错误应对应的步骤是最近记录的步骤步骤的结尾尚未记录。这些步骤对应于堆栈的以下部分（根据上图）：torch._dynamo.config.log_level = logging.INFOStep #: ...

步	元件
1	TorchDynamo
2	编译器后端
3	TorchInductor （Torch电感器）

AOTAutograd 的开头和结尾当前未记录，但我们计划尽快添加它。

如果信息日志记录不足，那么还有一些后端选项，使您能够确定哪个组件导致了如果您无法理解错误消息，则为 ERROR 生成。这些是：

"eager"：仅运行 torchdynamo forward graph capture，然后使用 PyTorch 运行捕获的图形。这提供了一个指示到 TorchDynamo 是否引发错误。
"aot_eager"：运行 torchdynamo 以捕获前向图，并且然后 AOTAutograd 来跟踪向后图形，而无需任何其他 backend 编译器步骤。PyTorch eager 将用于运行前向图和后向图。这有助于缩小问题范围到 AOTAutograd。

缩小问题范围的一般过程如下：

使用后端运行您的程序。如果错误不再，则问题出在正在使用的后端编译器中（如果使用 TorchInductor，继续执行步骤 2。如果没有，请参阅此部分）。如果错误仍然存在发生在后端，则在运行时会出现错误 Torch发电机。"eager""eager"
仅当用作后端时，才需要此步骤编译器。使用后端运行模型。如果此 backend 引发错误，则错误发生在 AOTAutograd 跟踪。如果此后端不再发生错误，那么错误就在 TorchInductor* 的 TorchInductor 中。TorchInductor"aot_eager"

以下各节将分析这些情况中的每一种。

注意

TorchInductor 后端包括 AOTAutograd 跟踪和 TorchInductor 编译器本身。我们会的通过引用 back 来消除歧义，并且 TorchInductor 降低作为降低图形的相位，由 AOTAutograd.TorchInductor

Torchdynamo 错误¶

如果生成的错误发生在后端，则 TorchDynamo 是最可能的错误来源。下面是一个示例代码这将产生错误。"eager"

import torch

import torch._dynamo as dynamo


def test_assertion_error():
    y = torch.ones(200, 200)
    z = {y: 5}
    return z

compiled_test_assertion_error = torch.compile(test_assertion_error, backend="eager")

compiled_test_assertion_error()

这将生成以下错误：

torch._dynamo.convert_frame: [ERROR] WON'T CONVERT test_assertion_error /scratch/mlazos/torchdynamo/../test/errors.py line 26
due to:
Traceback (most recent call last):
  File "/scratch/mlazos/torchdynamo/torchdynamo/symbolic_convert.py", line 837, in BUILD_MAP
    assert isinstance(k, ConstantVariable) or (
AssertionError

from user code:
   File "/scratch/mlazos/torchdynamo/../test/errors.py", line 34, in test_assertion_error
    z = {y: 5}

Set torch._dynamo.config.verbose=True for more information
==========

如消息所示，您可以设置为获取对两者的完整堆栈跟踪 TorchDynamo 中的错误和用户代码。除了这个标志之外，您还可以通过设置 TorchDynamo 的。可用级别包括以后：torch._dynamo.config.verbose=Truelog_leveltorch._dynamo.config.log_level

logging.DEBUG：打印每条指令除了以下所有日志级别之外还遇到。
logging.INFO: 打印编译的每个函数（原始和修改后的字节码）以及除以下所有日志级别之外捕获的图形。
logging.WARNING（默认）：除所有低于对数水平。
logging.ERROR：仅打印错误。

如果模型足够大，则日志可能会变得不堪重负。如果错误发生在模型的 Python 代码深处，它可能很有用仅执行发生错误的帧以启用更轻松调试。有两个工具可用于实现此功能：

将环境变量设置为所需的函数名称将仅在具有该名称的函数上运行 torchdynamo。TORCHDYNAMO_DEBUG_FUNCTION
启用 record/replay 工具（set ），该工具在遇到错误时转储执行记录。然后，可以重播此记录以仅运行发生错误的帧。torch._dynamo.config.replay_record_enabled = True

TorchInductor 错误¶

如果后端没有发生错误，则 backend 编译器是错误的来源（示例错误）。有 TorchDynamo 的后端编译器选项，带有 TorchInductor 或 nvfuser 适合大多数用户的需求。本节重点介绍 TorchInductor 作为激励性示例，但某些工具可以与其他后端编译器。"eager"

以下是我们关注的堆栈部分：

将 TorchInductor 作为选择的后端，AOTAutograd 用于从捕获的 Forward Graph 生成 Backward 图 Torch发电机。请务必注意，在此期间可能会发生错误 tracing 以及 TorchInductor 降低 forward 和 backward 图形转换为 GPU 代码或 C++。一个模型通常由数百个或数千个 FX 节点，因此缩小此问题的确切节点发生可能非常困难。幸运的是，有一些工具可用于自动将这些输入图表缩小到导致问题。第一步是确定是否发生错误在使用 AOTAutograd 跟踪后向图期间或在 TorchInductor 降低。如上文第 2 步所述，后端只能用于隔离运行 AOTAutograd 不降低。如果此后端仍然出现错误，则此表示在 AOTAutograd 跟踪过程中发生错误。"aot_eager"

下面是一个示例：

import torch

import torch._dynamo as dynamo

model = torch.nn.Sequential(*[torch.nn.Linear(200, 200) for _ in range(5)])

def test_backend_error():

    y = torch.ones(200, 200)
    x = torch.ones(200, 200)
    z = x + y
    a = torch.ops.aten._foobar(z)  # dummy function which errors
    return model(a)


compiled_test_backend_error = torch.compile(test_backend_error, backend="inductor")
compiled_test_backend_error()

运行此命令应该会给您带来此错误，并且下面有一个更长的堆栈跟踪它：

Traceback (most recent call last):
  File "/scratch/mlazos/torchdynamo/torchinductor/graph.py", line 246, in call_function
    return lowerings[target](*args, **kwargs)
  File "/scratch/mlazos/torchdynamo/torchinductor/lowering.py", line 185, in wrapped
    return decomp_fn(*args, **kwargs)
  File "/scratch/mlazos/torchdynamo/torchinductor/lowering.py", line 810, in _foobar
    assert False
AssertionError
...

完整堆栈错误跟踪

如果随后更改为，它将正常运行而不会出错，因为问题出在 TorchInductor 降低过程中，而不是在 AOTAutograd 中。torch.compile(backend="inductor")torch.compile(backend="aot_eager")

缩小 TorchInductor 错误¶

从这里开始，让我们运行缩小器以获取最小重现。设置环境变量（或直接设置）将生成一个 Python 程序将 AOTAutograd 生成的图形简化为重现错误的 smallest subgraph 的 Fragment 的 S S（示例见下文其中我们缩小了 torchdynamo 生成的图形）运行程序使用此环境变量应显示几乎相同输出，附加一行指示其中 has 被写入。输出目录可通过设置为有效的目录名称进行配置。决赛步骤是运行缩小器并检查它是否成功运行。一个成功的运行如下所示。如果缩小器成功运行，它会生成可运行的 python 代码，这将重现确切的错误。对于我们的示例，如下所示法典：TORCHDYNAMO_REPRO_AFTER=“aot”torch._dynamo.config.repro_after="aot"minifier_launcher.pytorch._dynamo.config.base_dir

import torch
from torch import tensor, device
import torch.fx as fx
from torch._dynamo.testing import rand_strided
from math import inf
from torch.fx.experimental.proxy_tensor import make_fx

# torch version: 1.13.0a0+gitfddfc44
# torch cuda version: 11.6
# torch git version: fddfc4488afb207971c54ad4bf58130fdc8a4dc5


# CUDA Info:
# nvcc: NVIDIA (R) Cuda compiler driver
# Copyright (c) 2005-2022 NVIDIA Corporation
# Built on Thu_Feb_10_18:23:41_PST_2022
# Cuda compilation tools, release 11.6, V11.6.112
# Build cuda_11.6.r11.6/compiler.30978841_0

# GPU Hardware Info:
# NVIDIA A100-SXM4-40GB : 8

from torch.nn import *

class Repro(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, add):
        _foobar = torch.ops.aten._foobar.default(add);  add = None
        return (_foobar,)

args = [((200, 200), (200, 1), torch.float32, 'cpu')]
args = [rand_strided(shape, stride, dtype, device) for shape, stride, dtype, device in args]
mod = make_fx(Repro())(*args)
from torch._inductor.compile_fx import compile_fx_inner

compiled = compile_fx_inner(mod, args)
compiled(*args)

该模块的方法包含确切的 op 这会导致问题。提交问题时，请包括任何缩小了重现以帮助调试。forwardRepro

缩小后端编译器错误¶

对于 TorchInductor 以外的后端编译器，查找导致错误的子图与 TorchInductor 中 errors 中的过程几乎相同，但有一个重要的警告。也就是说，缩小器现在将在由 TorchDynamo 跟踪，而不是 AOTAutograd 的输出图形。我们走通过一个例子。

import torch

import torch._dynamo as dynamo

model = torch.nn.Sequential(*[torch.nn.Linear(200, 200) for _ in range(5)])
# toy compiler which fails if graph contains relu
def toy_compiler(gm: torch.fx.GraphModule, _):
    for node in gm.graph.nodes:
        if node.target == torch.relu:
            assert False

    return gm


def test_backend_error():
    y = torch.ones(200, 200)
    x = torch.ones(200, 200)
    z = x + y
    a = torch.relu(z)
    return model(a)


compiled_test_backend_error = torch.compile(test_backend_error, backend=toy_compiler)
compiled_test_backend_error()

为了在 TorchDynamo 跟踪前向图形后运行代码，您可以使用 environment 变量。运行这个带有（或）的程序应该会产生这个 output和以下代码在 .TORCHDYNAMO_REPRO_AFTERTORCHDYNAMO_REPRO_AFTER=“dynamo”torch._dynamo.config.repro_after="dynamo"{torch._dynamo.config.base_dir}/repro.py

注意

TORCHDYNAMO_REPRO_AFTER 的另一个选项是，它将在生成 backward 图后运行 minifier。"aot"

import torch
import torch._dynamo as dynamo
from torch import tensor, device
import torch.fx as fx
from torch._dynamo.testing import rand_strided
from math import inf
from torch._dynamo.debug_utils import run_fwd_maybe_bwd

from torch.nn import *

class Repro(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, add):
        relu = torch.relu(add);  add = None
        return (relu,)


mod = Repro().cuda()
opt_mod = torch.compile(mod, backend="None")


args = [((200, 200), (200, 1), torch.float32, 'cpu', False)]
args = [rand_strided(sh, st, dt, dev).requires_grad_(rg) for (sh, st, dt, dev, rg) in args]


with torch.cuda.amp.autocast(enabled=False):
    ref = run_fwd_maybe_bwd(mod, args)
    res = run_fwd_maybe_bwd(opt_mod, args)

压缩器成功地将图形缩减为提高中的错误。与 TorhInductor Errors 中的程序的另一个区别是缩小器是在遇到 backend compiler 错误后自动运行。在成功运行，则缩小器会写入。toy_compilerrepro.pytorch._dynamo.config.base_dir

性能分析¶

访问 TorchDynamo Profiler¶

TorchDynamo 具有用于收集和显示的内置统计函数每个编译阶段所花费的时间。这些统计数据可以通过以下方式访问执行后调用 Torch._Dynamo。默认情况下，这将返回按名称在每个 TorchDynamo 函数中花费的编译时间。torch._dynamo.utils.compile_times()

TorchInductor 调试跟踪¶

TorchInductor 内置了 stats 和 trace 功能，用于显示时间在每个编译阶段花费、输出代码、输出图形可视化和 IR 转储。这是一个调试工具，旨在使其更容易了解 TorchInductor 的内部结构并对其进行故障排除。

设置环境变量将导致 debug trace 目录下：TORCH_COMPILE_DEBUG=1

$ env TORCH_COMPILE_DEBUG=1 python repro.py
torch._inductor.debug: [WARNING] model_forward_0 debug trace: /tmp/torchinductor_jansel/rh/crhwqgmbqtchqt3v3wdeeszjb352m4vbjbvdovaaeqpzi7tdjxqr.debug

下面是一个示例 debug 目录测试程序的输出：

torch.nn.Sequential(
        torch.nn.Linear(10, 10),
        torch.nn.LayerNorm(10),
        torch.nn.ReLU(),
    )

该调试跟踪中的每个文件都可以通过以下方式启用和禁用。配置文件和关系图都是默认情况下禁用，因为它们的生成成本很高。torch._inductor.config.trace.*

这种新调试格式的单个节点如下所示：

buf1: SchedulerNode(ComputedBuffer)
buf1.writes =
    {   MemoryDep(name='buf1', index=0, size=()),
        MemoryDep(name='buf1', index=0, size=(s0,))}
buf1.unmet_dependencies = {MemoryDep(name='buf0', index=c0, size=(s0,))}
buf1.met_dependencies = {MemoryDep(name='primals_2', index=c0, size=(s0,))}
buf1.group.device = cuda:0
buf1.group.iteration = (1, s0)
buf1.sizes = ([], [s0])
class buf1_loop_body:
    var_ranges = {z0: s0}
    index0 = z0
    index1 = 0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf0', get_index, False)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('primals_2', get_index_1, False)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index1')
        reduction = ops.reduction('buf1', torch.float32, torch.float32, 'sum', get_index_2, add)
        return reduction

请参阅示例 debug 目录 output 以获取更多示例。

图形中断¶

给定一个这样的程序：

def some_fun(x):
    ...

compiled_fun = torch.compile(some_fun, ...)
...

TorchDynamo 将尝试编译所有 torch/tensor作 some_fun到单个 FX 图表中，但可能无法捕获所有内容都集成到一个图表中。

有些图形中断原因对于 TorchDynamo 来说是无法克服的，而且不可能易于修复。- 调用 torch 以外的 C 扩展是不可见的传递给 torchdynamo，并且可以在没有 TorchDynamo 的情况下执行任意作能够引入必要的守卫确保编译后的程序可以安全地重用。图表中断如果生成的片段较小，则可能会影响性能。最大化性能，请务必尽可能少地使用图表中断。

确定图形中断的原因¶

识别程序中的所有图形中断以及可以使用中断。此工具运行 TorchDynamo 并聚合图形分隔遇到的。下面是一个示例用法：torch._dynamo.explain

import torch
import torch._dynamo as dynamo
def toy_example(a, b):
    x = a / (torch.abs(a) + 1)
    print("woo")
    if b.sum() < 0:
        b = b * -1
    return x * b
explanation, out_guards, graphs, ops_per_graph = dynamo.explain(toy_example, torch.randn(10), torch.randn(10))
print(explanation)
"""
Dynamo produced 3 graphs, with 2 graph break and 6 ops.
 Break reasons:
1. call_function BuiltinVariable(print) [ConstantVariable(str)] {}
   File "t2.py", line 16, in toy_example
    print("woo")

2. generic_jump
   File "t2.py", line 17, in toy_example
    if b.sum() < 0:
 """

输出包括：

out_guards- 一个列表列表，其中每个 sublist 都包含必须传递的守卫，以确保跟踪的图形有效。
graphs- 已成功跟踪的图形模块的列表。
ops_per_graph- 一个列表列表，其中每个子列表都包含在图中运行的运算。

要在遇到的第一个图形分隔符上引发错误，请使用 mode.此模式会禁用 TorchDynamo 的 Python 回退，并且仅禁用如果整个程序可转换为单个图形，则成功。例用法：nopython

def toy_example(a, b):
   ...

compiled_toy = torch.compile(toy_example, fullgraph=True, backend=<compiler>)

过度重新编译¶

当 TorchDynamo 编译一个函数（或一个函数的一部分）时，它会确保关于局部变量和全局变量的假设，以便允许编译器优化，并将这些假设表示为检查运行时的特定值。如果这些防护中的任何一个失效，Dynamo 将重新编译该函数（或部分）最多多次。如果您的程序是达到缓存限制时，您首先需要确定哪个守卫是失败以及程序的哪个部分触发了它。torch._dynamo.config.cache_size_limit

编译分析器自动执行将 TorchDynamo 的缓存限制设置为 1 并运行程序在仅观察的 “编译器” 下，记录任何守卫故障。您应该确保至少运行您的程序与你遇到 trouble，分析器将在此持续时间内累积统计信息。

如果您的程序表现出有限数量的动态性，则您可能能够调整 TorchDynamo 缓存限制以允许每个变体编译并缓存，但如果缓存限制太高，您可能会发现重新编译的成本超过了任何优化的好处。

torch._dynamo.config.cache_size_limit = <your desired cache limit>

Torchdynamo 计划支持许多常见的动态张量形状、例如更改批量大小或序列长度。它不打算支持 rank-dynamism。同时，设置特定的缓存限制可与分桶技术配合使用，以实现某些动态模型的可接受重新编译次数。

from torch._dynamo.utils import CompileProfiler

prof = CompileProfiler()

def my_model():
    ...

profiler_model = torch.compile(my_model, backend=prof)
profiler_model()
print(prof.report())

精度调试¶

如果您设置环境变量，它也可以缩小准确性问题，它与类似的 git bisect 一起作 model 和完全复制可能类似于我们需要这是下游编译器将 codegen 代码是否是 Triton 代码或 C++ 后端，来自下游的数字编译器可能以细微的方式有所不同，但对您的训练稳定性。所以 accuracy 调试器对我们来说非常有用来检测 Codegen 中的错误或使用后端编译器。TORCHDYNAMO_REPRO_LEVEL=4TORCHDYNAMO_REPRO_AFTER="aot" TORCHDYNAMO_REPRO_LEVEL=4

提交 Issue¶

如果您在使用 TorchDynamo 时遇到问题，请提交 GitHub 问题。

在提交问题之前，请仔细阅读 README、TROUBLESHOOTING 并搜索类似的内容问题。

提交问题时，请包括有关您的作系统、Python<PyTorch、CUDA 和 Triton 版本信息，通过运行：

python tools/verify_install.py

如果可能，可以使用最小的重现脚本，可以通过运行缩小器
错误描述
预期行为
日志（设置为有效的文件名将日志转储到一个文件和torch._dynamo.config.log_filetorch._dynamo.config.log_level = logging.DEBUGtorch._dynamo.config.verbose = True)