自定义 C++ 和 CUDA 运算符¶

创建时间： Jun 18， 2024 |上次更新时间：2024 年 12 月 30 日 |上次验证： Nov 05， 2024

您将学到什么

如何将用 C++/CUDA 编写的自定义运算符与 PyTorch 集成
如何使用torch.library.opcheck

先决条件

PyTorch 2.4 或更高版本
对 C++ 和 CUDA 编程有基本的了解

PyTorch 提供了一个大型的运算符库，这些运算符适用于张量（例如 torch.add、torch.sum 等）。但是，您可能希望为 PyTorch 引入新的自定义运算符。本教程演示了编写用 C++/CUDA 编写的自定义运算符的福道。

在我们的教程中，我们将演示如何编写融合的乘加 C++ 以及由 PyTorch 子系统组成的 CUDA 运算符。的语义作如下：

def mymuladd(a: Tensor, b: Tensor, c: float):
    return a * b + c

您可以在此处找到本教程的端到端工作示例。

设置 Build System¶

如果您正在开发自定义 C++/CUDA 代码，则必须对其进行编译。请注意，如果要与已经具有 bindings 的 Python 库交互要预编译 C++/CUDA 代码，您可以考虑编写自定义 Python 运算符而是（自定义 Python 运算符）。

使用 torch.utils.cpp_extension 编译自定义 C++/CUDA 代码以用于 PyTorch C++ 扩展可以使用 setuptools “提前”构建，也可以“即时”构建通过 load_inline; 我们将专注于 “Ahead of Time” 的风味。

使用就像编写以下内容一样简单：cpp_extensionsetup.py

from setuptools import setup, Extension
from torch.utils import cpp_extension

setup(name="extension_cpp",
      ext_modules=[
          cpp_extension.CppExtension("extension_cpp", ["muladd.cpp"])],
      cmdclass={'build_ext': cpp_extension.BuildExtension})

如果您需要编译 CUDA 代码（例如，文件），请改用 torch.utils.cpp_extension。CUDAExtension. 请参阅 extension-cpp 以获取示例，了解如何设置。.cu

从 PyTorch 2.6 开始，您现在可以为多个 CPython 构建单个 wheel 版本（类似于对纯 Python 包执行的作）。特别如果您的自定义库遵循 CPython Stable Limited API 或完全避免使用 CPython，则您可以针对支持的最低 CPython 版本构建一个与 Python 无关的轮子通过 setupTools 的标志，如下所示：py_limited_api

from setuptools import setup, Extension
from torch.utils import cpp_extension

setup(name="extension_cpp",
      ext_modules=[
          cpp_extension.CppExtension(
            "extension_cpp",
            ["python_agnostic_code.cpp"],
            py_limited_api=True)],
      cmdclass={'build_ext': cpp_extension.BuildExtension},
      options={"bdist_wheel": {"py_limited_api": "cp39"}}
)

请注意，您必须在命令中指定和作为选项，并支持最小 Python 版本（在本例中为 3.9）。这将构建一个可以安装在多个 Python 版本中。请参阅 torchao 的示例。py_limited_api=Truesetup"bdist_wheel"setuppython>=3.9

注意

您必须独立验证构建的 wheel 是否确实与 Python 无关。指定不会检查任何保证，因此是可能的构建一个看起来与 Python 无关但会崩溃的轮子，或者更糟糕的是，以静默方式错误，在另一个 Python 环境中。注意避免使用不稳定的 CPython API，例如来自 libtorch_python 的 API（特别是 pytorch/python 绑定）并且仅使用 libtorch 中的 API（aten 对象、运算符和调度程序）。例如，要从 Python 访问自定义运算，该库应注册通过 Dispatcher 进行 OPS（如下所述！py_limited_api

定义自定义运算并添加后端实现¶

首先，让我们编写一个 C++ 函数来计算：mymuladd

at::Tensor mymuladd_cpu(at::Tensor a, const at::Tensor& b, double c) {
  TORCH_CHECK(a.sizes() == b.sizes());
  TORCH_CHECK(a.dtype() == at::kFloat);
  TORCH_CHECK(b.dtype() == at::kFloat);
  TORCH_INTERNAL_ASSERT(a.device().type() == at::DeviceType::CPU);
  TORCH_INTERNAL_ASSERT(b.device().type() == at::DeviceType::CPU);
  at::Tensor a_contig = a.contiguous();
  at::Tensor b_contig = b.contiguous();
  at::Tensor result = torch::empty(a_contig.sizes(), a_contig.options());
  const float* a_ptr = a_contig.data_ptr<float>();
  const float* b_ptr = b_contig.data_ptr<float>();
  float* result_ptr = result.data_ptr<float>();
  for (int64_t i = 0; i < result.numel(); i++) {
    result_ptr[i] = a_ptr[i] * b_ptr[i] + c;
  }
  return result;
}

为了从 PyTorch 的 Python 前端使用它，我们需要注册它作为使用 API 的 PyTorch作员。这将自动将 operator 绑定到 Python。TORCH_LIBRARY

作员注册分为两个步骤：

定义运算符 - 此步骤可确保 PyTorch 知道新运算符。
注册后端实现 - 在此步骤中，各种后端（如 CPU 和 CUDA）与运算符相关联。

定义运算符¶

要定义运算符，请执行以下步骤：

为 Operator 选择命名空间。我们建议 namespace 是您的顶级项目;我们将在本教程中使用 “extension_cpp”。
提供指定运算符的输入/输出类型的架构字符串，如果 input Tensors 将被改变。除了 Tensor 和 float 之外，我们还支持更多类型; 有关更多详细信息，请参阅自定义作手册。
- 如果您正在编写可以改变其输入 Tensor 的运算符，请参阅此处（创建可变运算符）了解如何指定该作。

TORCH_LIBRARY(extension_cpp, m) {
   // Note that "float" in the schema corresponds to the C++ double type
   // and the Python float type.
   m.def("mymuladd(Tensor a, Tensor b, float c) -> Tensor");
 }

这使得运算符可通过 Python 使用。torch.ops.extension_cpp.mymuladd

为 Operator 注册后端实现¶

用于为 Operator 注册后端实现。TORCH_LIBRARY_IMPL

TORCH_LIBRARY_IMPL(extension_cpp, CPU, m) {
  m.impl("mymuladd", &mymuladd_cpu);
}

如果您还有的 CUDA 实现，则可以注册它在单独的块中：myaddmulTORCH_LIBRARY_IMPL

__global__ void muladd_kernel(int numel, const float* a, const float* b, float c, float* result) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx < numel) result[idx] = a[idx] * b[idx] + c;
}

at::Tensor mymuladd_cuda(const at::Tensor& a, const at::Tensor& b, double c) {
  TORCH_CHECK(a.sizes() == b.sizes());
  TORCH_CHECK(a.dtype() == at::kFloat);
  TORCH_CHECK(b.dtype() == at::kFloat);
  TORCH_INTERNAL_ASSERT(a.device().type() == at::DeviceType::CUDA);
  TORCH_INTERNAL_ASSERT(b.device().type() == at::DeviceType::CUDA);
  at::Tensor a_contig = a.contiguous();
  at::Tensor b_contig = b.contiguous();
  at::Tensor result = torch::empty(a_contig.sizes(), a_contig.options());
  const float* a_ptr = a_contig.data_ptr<float>();
  const float* b_ptr = b_contig.data_ptr<float>();
  float* result_ptr = result.data_ptr<float>();

  int numel = a_contig.numel();
  muladd_kernel<<<(numel+255)/256, 256>>>(numel, a_ptr, b_ptr, c, result_ptr);
  return result;
}

TORCH_LIBRARY_IMPL(extension_cpp, CUDA, m) {
  m.impl("mymuladd", &mymuladd_cuda);
}

添加对运算符的支持`torch.compile`¶

要添加对运算符的支持，我们必须添加一个 FakeTensor 内核（以及称为“元内核”或“抽象 impl”）。FakeTensor 是具有元数据（例如 shape、dtype、device）但没有数据：FakeTensor 内核的 operator 指定如何在给定输入张量元数据的情况下计算输出张量的元数据。 FakeTensor 内核应返回您选择的虚拟 Tensor 正确的 Tensor 元数据（shape/strides//device）。torch.compiledtype

我们建议通过 API 从 Python 完成此作，尽管也可以从 C++ 执行此作（有关更多详细信息，请参阅 The Custom Operators Manual）。torch.library.register_fake

# Important: the C++ custom operator definitions should be loaded first
# before calling ``torch.library`` APIs that add registrations for the
# C++ custom operator(s). The following import loads our
# C++ custom operator definitions.
# Note that if you are striving for Python agnosticism, you should use
# the ``load_library(...)`` API call instead. See the next section for
# more details.
from . import _C

@torch.library.register_fake("extension_cpp::mymuladd")
def _(a, b, c):
    torch._check(a.shape == b.shape)
    torch._check(a.dtype == torch.float)
    torch._check(b.dtype == torch.float)
    torch._check(a.device == b.device)
    return torch.empty_like(a)

设置 Python/C++ 混合注册¶

在本教程中，我们在 C++ 中定义了一个自定义运算符，添加了 CPU/CUDA C++ 中的实现，并添加了内核和向后公式在 Python 中。加载（或导入）这些注册的顺序 Matters （以错误的顺序导入将导致错误）。FakeTensor

要将自定义运算符与混合 Python/C++ 注册一起使用，我们必须首先加载包含自定义运算符定义的 C++ 库然后调用注册 API。这可能发生在一个有两种方式：torch.library

如果遵循本教程，请导入 Python C 扩展模块我们创建的将加载 C++ 自定义运算符定义。
如果您的 C++ 自定义运算符位于共享库对象中，则可以也用于加载它。这是 Python 不可知论的福道，因为您不会有 Python C 扩展模块导入。有关示例，请参阅 torchao __init__.py。torch.ops.load_library("/path/to/library.so")

为作员添加训练（autograd）支持¶

用于为作员添加训练支持。喜欢这直接使用 Python 或 C++ ; 您必须以非常具体的方式使用它们，以避免静默错误（有关更多详细信息，请参阅 The Custom Operators Manual）。torch.library.register_autogradtorch.autograd.Functiontorch::autograd::Function

def _backward(ctx, grad):
    a, b = ctx.saved_tensors
    grad_a, grad_b = None, None
    if ctx.needs_input_grad[0]:
        grad_a = grad * b
    if ctx.needs_input_grad[1]:
        grad_b = grad * a
    return grad_a, grad_b, None

def _setup_context(ctx, inputs, output):
    a, b, c = inputs
    saved_a, saved_b = None, None
    if ctx.needs_input_grad[0]:
        saved_b = b
    if ctx.needs_input_grad[1]:
        saved_a = a
    ctx.save_for_backward(saved_a, saved_b)

# This code adds training support for the operator. You must provide us
# the backward formula for the operator and a `setup_context` function
# to save values to be used in the backward.
torch.library.register_autograd(
    "extension_cpp::mymuladd", _backward, setup_context=_setup_context)

请注意，backward 必须是 PyTorch 理解的运算符的组合。如果您希望在向后传递中使用另一个自定义 C++ 或 CUDA 内核，它必须包装到自定义运算符中。

如果我们有自己的自定义内核，则需要将其包装到 custom 运算符，然后从后向后调用它：mymul

// New! a mymul_cpu kernel
at::Tensor mymul_cpu(const at::Tensor& a, const at::Tensor& b) {
  TORCH_CHECK(a.sizes() == b.sizes());
  TORCH_CHECK(a.dtype() == at::kFloat);
  TORCH_CHECK(b.dtype() == at::kFloat);
  TORCH_CHECK(a.device().type() == at::DeviceType::CPU);
  TORCH_CHECK(b.device().type() == at::DeviceType::CPU);
  at::Tensor a_contig = a.contiguous();
  at::Tensor b_contig = b.contiguous();
  at::Tensor result = torch::empty(a_contig.sizes(), a_contig.options());
  const float* a_ptr = a_contig.data_ptr<float>();
  const float* b_ptr = b_contig.data_ptr<float>();
  float* result_ptr = result.data_ptr<float>();
  for (int64_t i = 0; i < result.numel(); i++) {
    result_ptr[i] = a_ptr[i] * b_ptr[i];
  }
  return result;
}

TORCH_LIBRARY(extension_cpp, m) {
  m.def("mymuladd(Tensor a, Tensor b, float c) -> Tensor");
  // New! defining the mymul operator
  m.def("mymul(Tensor a, Tensor b) -> Tensor");
}


TORCH_LIBRARY_IMPL(extension_cpp, CPU, m) {
  m.impl("mymuladd", &mymuladd_cpu);
  // New! registering the cpu kernel for the mymul operator
  m.impl("mymul", &mymul_cpu);
}

def _backward(ctx, grad):
    a, b = ctx.saved_tensors
    grad_a, grad_b = None, None
    if ctx.needs_input_grad[0]:
        grad_a = torch.ops.extension_cpp.mymul.default(grad, b)
    if ctx.needs_input_grad[1]:
        grad_b = torch.ops.extension_cpp.mymul.default(grad, a)
    return grad_a, grad_b, None


def _setup_context(ctx, inputs, output):
    a, b, c = inputs
    saved_a, saved_b = None, None
    if ctx.needs_input_grad[0]:
        saved_b = b
    if ctx.needs_input_grad[1]:
        saved_a = a
    ctx.save_for_backward(saved_a, saved_b)


# This code adds training support for the operator. You must provide us
# the backward formula for the operator and a `setup_context` function
# to save values to be used in the backward.
torch.library.register_autograd(
    "extension_cpp::mymuladd", _backward, setup_context=_setup_context)

测试 Operator¶

用于测试自定义运算是否已正确注册。请注意，此函数不会测试梯度在数学上是否正确 – 计划为此编写单独的测试，手动测试或使用 .torch.library.opchecktorch.autograd.gradcheck

def sample_inputs(device, *, requires_grad=False):
    def make_tensor(*size):
        return torch.randn(size, device=device, requires_grad=requires_grad)

    def make_nondiff_tensor(*size):
        return torch.randn(size, device=device, requires_grad=False)

    return [
        [make_tensor(3), make_tensor(3), 1],
        [make_tensor(20), make_tensor(20), 3.14],
        [make_tensor(20), make_nondiff_tensor(20), -123],
        [make_nondiff_tensor(2, 3), make_tensor(2, 3), -0.3],
    ]

def reference_muladd(a, b, c):
    return a * b + c

samples = sample_inputs(device, requires_grad=True)
samples.extend(sample_inputs(device, requires_grad=False))
for args in samples:
    # Correctness test
    result = torch.ops.extension_cpp.mymuladd(*args)
    expected = reference_muladd(*args)
    torch.testing.assert_close(result, expected)

    # Use opcheck to check for incorrect usage of operator registration APIs
    torch.library.opcheck(torch.ops.extension_cpp.mymuladd.default, args)

创建可变运算符¶

您可能希望编写一个自定义运算符来改变其输入。用于指定 schema 中的每个可变 Tensor;否则，将出现 undefined 行为。如果有多个突变的 Tensor，请为每个可变 Tensor 使用不同的名称（例如，，，）。Tensor(a!)Tensor(a!)Tensor(b!)Tensor(c!)

让我们编写一个运算符，该运算符将的内容写入。myadd_out(a, b, out)a+bout

// An example of an operator that mutates one of its inputs.
void myadd_out_cpu(const at::Tensor& a, const at::Tensor& b, at::Tensor& out) {
  TORCH_CHECK(a.sizes() == b.sizes());
  TORCH_CHECK(b.sizes() == out.sizes());
  TORCH_CHECK(a.dtype() == at::kFloat);
  TORCH_CHECK(b.dtype() == at::kFloat);
  TORCH_CHECK(out.dtype() == at::kFloat);
  TORCH_CHECK(out.is_contiguous());
  TORCH_INTERNAL_ASSERT(a.device().type() == at::DeviceType::CPU);
  TORCH_INTERNAL_ASSERT(b.device().type() == at::DeviceType::CPU);
  TORCH_INTERNAL_ASSERT(out.device().type() == at::DeviceType::CPU);
  at::Tensor a_contig = a.contiguous();
  at::Tensor b_contig = b.contiguous();
  const float* a_ptr = a_contig.data_ptr<float>();
  const float* b_ptr = b_contig.data_ptr<float>();
  float* result_ptr = out.data_ptr<float>();
  for (int64_t i = 0; i < out.numel(); i++) {
    result_ptr[i] = a_ptr[i] + b_ptr[i];
  }
}

在定义 operator 时，我们必须指定它在 schema 中改变 out Tensor：

TORCH_LIBRARY(extension_cpp, m) {
  m.def("mymuladd(Tensor a, Tensor b, float c) -> Tensor");
  m.def("mymul(Tensor a, Tensor b) -> Tensor");
  // New!
  m.def("myadd_out(Tensor a, Tensor b, Tensor(a!) out) -> ()");
}

TORCH_LIBRARY_IMPL(extension_cpp, CPU, m) {
  m.impl("mymuladd", &mymuladd_cpu);
  m.impl("mymul", &mymul_cpu);
  // New!
  m.impl("myadd_out", &myadd_out_cpu);
}

注意

不要返回任何突变的 Tensor 作为运算符的输出，因为这会导致与 PyTorch 子系统不兼容，例如 .torch.compile

结论¶

在本教程中，我们介绍了集成自定义 C++ 的推荐方法以及 PyTorch 的 CUDA 运算符。API 是公平的低级。有关如何使用 API 的更多信息，请参阅自定义作员手册。TORCH_LIBRARY/torch.library

自定义 C++ 和 CUDA 运算符¶

设置 Build System¶

定义自定义运算并添加后端实现¶

定义运算符¶

为 Operator 注册后端实现¶

添加对运算符的支持`torch.compile`¶

设置 Python/C++ 混合注册¶

为作员添加训练（autograd）支持¶

测试 Operator¶

创建可变运算符¶

结论¶

文档

教程

资源

自定义 C++ 和 CUDA 运算符¶

设置 Build System¶

定义自定义运算并添加后端实现¶

定义运算符¶

为 Operator 注册后端实现¶

添加对运算符的支持torch.compile¶

设置 Python/C++ 混合注册¶

为作员添加训练 （autograd） 支持¶

测试 Operator¶

创建可变运算符¶

结论¶

文档

教程

资源

添加对运算符的支持`torch.compile`¶

为作员添加训练（autograd）支持¶