自定义后端¶

概述¶

torch.compile提供了一种简单的方法来启用用户来定义自定义后端。

后端函数具有 Contract .(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]) -> Callable

后端函数可以由 TorchDynamo 调用，TorchDynamo 是的图形跟踪组件，在跟踪 FX 图表后，并且预期返回等效于跟踪的 FX 图的编译函数。返回的可调用对象应与传入后端的原始函数具有相同的 contract：。torch.compileforwardtorch.fx.GraphModule(*args: torch.Tensor) -> List[torch.Tensor]

为了让 TorchDynamo 调用您的后端，请将您的后端函数作为 kwarg 传入。例如backendtorch.compile

import torch

def my_custom_backend(gm, example_inputs):
    return gm.forward

def f(...):
    ...

f_opt = torch.compile(f, backend=my_custom_backend)

@torch.compile(backend=my_custom_backend)
def g(...):
    ...

有关更多示例，请参阅下文。

注册自定义后端¶

你可以使用装饰器注册你的后端，例如，register_backend

from torch._dynamo.optimizations import register_backend

@register_backend
def my_compiler(gm, example_inputs):
    ...

除了装饰器之外，如果你的后端在另一个 python 包中，你还可以注册你的 backend 通过 Python package 的入口点，这为一个 package 提供了一种为另一个 package 注册插件的方法。register_backend

提示

您可以在 python 打包文档中了解更多信息。entry_points

要通过注册后端，您可以将后端函数添加到包文件中的入口点组，例如：entry_pointstorch_dynamo_backendssetup.py

...
setup(
    ...
    'torch_dynamo_backends': [
        'my_compiler = your_module.submodule:my_compiler',
    ]
    ...
)

请将 before 替换为您的后端名称，并将 after 部分替换为后端函数的模块和函数名称。安装包后，入口点将添加到您的 python 环境中。当您调用时，PyTorch 将首先搜索已注册的名为的后端。如果未找到，它将继续在所有已注册的后端中搜索通过。my_compiler==torch.compile(model, backend="my_compiler")my_compilerregister_backendentry_points

注册有两个目的：

你可以传递一个包含后端函数名称的字符串，而不是函数本身。例如。torch.compiletorch.compile(model, backend="my_compiler")
它与缩小器一起使用是必需的。任何生成的来自缩小器的代码必须调用注册后端函数的代码，通常通过语句。import

AOTAutograd 之后的自定义后端¶

可以定义由 AOTAutograd 而不是 TorchDynamo 调用的自定义后端。这很有用，主要有两个原因：

用户可以定义支持模型训练的后端，因为 AOTAutograd 可以生成后向图进行编译。
AOTAutograd 生成由规范 Aten 运算组成的 FX 图。因此，自定义后端只需要支持规范的 Aten Opset，这是一个比整个 Torch/Aten Opset 小得多的 Opset。

像以前一样用 kwarg 包装后端并使用。包装的后端函数应具有与以前相同的 Contract。torch._dynamo.optimizations.training.aot_autogradtorch.compilebackendaot_autograd

后端函数通过（forward 编译器）传递给或（向后编译器）kwargs。如果未指定，则向后编译函数默认为 forward compile 函数。aot_autogradfw_compilerbw_compilerbw_compiler

需要注意的是，AOTAutograd 要求后端返回的编译函数被“装箱”。这可以通过包装带有 .functorch.compile.make_boxed_func

例如

from torch._dynamo.optimizations.training import aot_autograd
from functorch.compile import make_boxed_func

def my_compiler(gm, example_inputs):
    return make_boxed_func(gm.forward)

my_backend = aot_autograd(fw_compiler=my_compiler)  # bw_compiler=my_compiler

model_opt = torch.compile(model, backend=my_backend)

例子¶

调试后端¶

如果您想更好地了解编译中，您可以创建自定义编译器，称为 backend 中，将打印 pretty print 从 Dynamo 的字节码分析中提取的 FX 并返回一个 callable。GraphModuleforward()

例如：

from typing import List
import torch
def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
    print("my_compiler() called with FX graph:")
    gm.graph.print_tabular()
    return gm.forward  # return a python callable
@torch.compile(backend=my_compiler)
def fn(x, y):
    a = torch.cos(x)
    b = torch.sin(y)
    return a + b
fn(torch.randn(10), torch.randn(10))

运行上述示例将生成以下输出：

my_compiler() called with FX graph:
opcode         name    target                                                  args        kwargs
-------------  ------  ------------------------------------------------------  ----------  --------
placeholder    x       x                                                       ()          {}
placeholder    y       y                                                       ()          {}
call_function  cos     <built-in method cos of type object at 0x7f1a894649a8>  (x,)        {}
call_function  sin     <built-in method sin of type object at 0x7f1a894649a8>  (y,)        {}
call_function  add     <built-in function add>                                 (cos, sin)  {}
output         output  output                                                  ((add,),)   {}

这适用于如下所示：torch.nn.Module

from typing import List
import torch
def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
    print("my_compiler() called with FX graph:")
    gm.graph.print_tabular()
    return gm.forward  # return a python callable
class MockModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.relu = torch.nn.ReLU()
    def forward(self, x):
        return self.relu(torch.cos(x))
mod = MockModule()
optimized_mod = torch.compile(mod, backend=my_compiler)
optimized_mod(torch.randn(10))

让我们再看一个带有 Control flow 的示例：

from typing import List
import torch
def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
    print("my_compiler() called with FX graph:")
    gm.graph.print_tabular()
    return gm.forward  # return a python callable
@torch.compile(backend=my_compiler)
def toy_example(a, b):
    x = a / (torch.abs(a) + 1)
    if b.sum() < 0:
        b = b * -1
    return x * b
for _ in range(100):
    toy_example(torch.randn(10), torch.randn(10))

运行此示例将生成以下输出：

my_compiler() called with FX graph:
opcode         name     target                                                  args              kwargs
-------------  -------  ------------------------------------------------------  ----------------  --------
placeholder    a        a                                                       ()                {}
placeholder    b        b                                                       ()                {}
call_function  abs_1    <built-in method abs of type object at 0x7f8d259298a0>  (a,)              {}
call_function  add      <built-in function add>                                 (abs_1, 1)        {}
call_function  truediv  <built-in function truediv>                             (a, add)          {}
call_method    sum_1    sum                                                     (b,)              {}
call_function  lt       <built-in function lt>                                  (sum_1, 0)        {}
output         output   output                                                  ((truediv, lt),)  {}

my_compiler() called with FX graph:
opcode         name    target                   args         kwargs
-------------  ------  -----------------------  -----------  --------
placeholder    b       b                        ()           {}
placeholder    x       x                        ()           {}
call_function  mul     <built-in function mul>  (b, -1)      {}
call_function  mul_1   <built-in function mul>  (x, mul)     {}
output         output  output                   ((mul_1,),)  {}

my_compiler() called with FX graph:
opcode         name    target                   args       kwargs
-------------  ------  -----------------------  ---------  --------
placeholder    b       b                        ()         {}
placeholder    x       x                        ()         {}
call_function  mul     <built-in function mul>  (x, b)     {}
output         output  output                   ((mul,),)  {}

最后两个图形的顺序是不确定的，具体取决于 Just-in-Time 编译器首先遇到哪个 ID。

快速后端¶

集成提供卓越性能的自定义后端也是 easy 的，我们将集成一个真实的与 optimize_for_inference：

def optimize_for_inference_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
    scripted = torch.jit.script(gm)
    return torch.jit.optimize_for_inference(scripted)

然后，您应该能够使用以下命令优化任何现有代码：

@torch.compile(backend=optimize_for_inference_compiler)
def code_to_accelerate():
    ...

可组合后端¶

TorchDynamo 包含许多后端，可以在 backends.py 或中找到。您可以组合这些后端以及以下代码：torch._dynamo.list_backends()

from torch._dynamo.optimizations import BACKENDS
 def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
     try:
         trt_compiled = BACKENDS["tensorrt"](gm, example_inputs)
         if trt_compiled is not None:
             return trt_compiled
     except Exception:
         pass
     # first backend failed, try something else...
     try:
         inductor_compiled = BACKENDS["inductor"](gm, example_inputs)
         if inductor_compiled is not None:
             return inductor_compiled
     except Exception:
         pass
     return gm.forward