CUDA语义¶

torch.cuda 用于设置和运行CUDA操作。它会跟踪当前选择的GPU，并且你分配的所有CUDA张量默认将在该设备上创建。可以使用 torch.cuda.device 上下文管理器来更改选定的设备。

然而，一旦张量被分配后，无论所选设备是什么，你都可以对其进行操作，结果始终会放置在与张量相同的设备上。

默认情况下，跨GPU操作是不允许的，除了 copy_() 和其他具有复制功能的方法如 to() 和 cuda(). 除非你启用了对等内存访问，否则任何尝试在不同设备上启动张量的操作都会引发错误。

下面你可以找到一个展示此功能的小示例：

cuda = torch.device('cuda')     # Default CUDA device
cuda0 = torch.device('cuda:0')
cuda2 = torch.device('cuda:2')  # GPU 2 (these are 0-indexed)

x = torch.tensor([1., 2.], device=cuda0)
# x.device is device(type='cuda', index=0)
y = torch.tensor([1., 2.]).cuda()
# y.device is device(type='cuda', index=0)

with torch.cuda.device(1):
    # allocates a tensor on GPU 1
    a = torch.tensor([1., 2.], device=cuda)

    # transfers a tensor from CPU to GPU 1
    b = torch.tensor([1., 2.]).cuda()
    # a.device and b.device are device(type='cuda', index=1)

    # You can also use ``Tensor.to`` to transfer a tensor:
    b2 = torch.tensor([1., 2.]).to(device=cuda)
    # b.device and b2.device are device(type='cuda', index=1)

    c = a + b
    # c.device is device(type='cuda', index=1)

    z = x + y
    # z.device is device(type='cuda', index=0)

    # even within a context, you can specify the device
    # (or give a GPU index to the .cuda call)
    d = torch.randn(2, device=cuda2)
    e = torch.randn(2).to(cuda2)
    f = torch.randn(2).cuda(cuda2)
    # d.device, e.device, and f.device are all device(type='cuda', index=2)

Ampere设备上的TensorFloat-32(TF32)¶

从 PyTorch 1.7 开始，新增了一个名为 allow_tf32 的标志，默认值为 true。此标志控制是否允许 PyTorch 使用 TensorFloat32（TF32）张量核心，在 Ampere 架构之后的 NVIDIA GPU 上可用，用于内部计算矩阵乘法（包括批量矩阵乘法）和卷积操作。

TF32 张量核心旨在通过对输入数据进行舍入以具有 10 位尾数并在 FP32 精度下累积结果，实现在 torch.float32 张量上实现更好的矩阵乘法和卷积性能，同时保持 FP32 动态范围。

矩阵乘法和卷积是分别控制的，它们对应的标志可以在以下位置访问：

# The flag below controls whether to allow TF32 on matmul. This flag defaults to True.
torch.backends.cuda.matmul.allow_tf32 = True

# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True

请注意，除了矩阵乘法和卷积本身之外，内部使用矩阵乘法或卷积的函数和 nn 模块也会受到影响。这些包括 nn.Linear, nn.Conv*, cdist, tensordot, affine grid 和 grid sample, adaptive log softmax, GRU 和 LSTM。

要了解精度和速度的大致情况，请参看下面的示例代码：

a_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
b_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
ab_full = a_full @ b_full
mean = ab_full.abs().mean()  # 80.7277

a = a_full.float()
b = b_full.float()

# Do matmul at TF32 mode.
ab_tf32 = a @ b  # takes 0.016s on GA100
error = (ab_tf32 - ab_full).abs().max()  # 0.1747
relative_error = error / mean  # 0.0022

# Do matmul with TF32 disabled.
torch.backends.cuda.matmul.allow_tf32 = False
ab_fp32 = a @ b  # takes 0.11s on GA100
error = (ab_fp32 - ab_full).abs().max()  # 0.0031
relative_error = error / mean  # 0.000039

从上面的例子可以看出，启用 TF32 后，速度大约快了 7 倍，与双精度相比，相对误差大约大了两个数量级。如果需要完整的 FP32 精度，用户可以通过以下方式禁用 TF32：

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

要在C++中关闭TF32标志，你可以这样做

at::globalContext().setAllowTF32CuBLAS(false);
at::globalContext().setAllowTF32CuDNN(false);

有关TF32的更多信息，请参见：

异步执行¶

默认情况下，GPU操作是异步的。当你调用使用GPU的函数时，操作会被排队到特定设备上，但不一定立即执行。这使我们能够并行执行更多的计算，包括在CPU或其他GPU上的操作。

一般来说，异步计算的效果对调用者来说是不可见的，因为（1）每个设备按照它们被排队的顺序执行操作，并且（2）PyTorch在CPU和GPU之间或两个GPU之间复制数据时自动进行必要的同步。因此，计算将像每个操作都是同步执行的一样进行。

你可以通过设置环境变量 CUDA_LAUNCH_BLOCKING=1 来强制同步计算。当GPU上发生错误时，这可能会很有用。（在异步执行中，这样的错误直到操作实际执行后才会被报告，因此堆栈跟踪不会显示请求的位置。）

异步计算的一个后果是，没有同步的时间测量不准确。为了获得精确的测量结果，应该在测量之前调用 torch.cuda.synchronize()，或者使用 torch.cuda.Event 来记录时间，如下所示：

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()

# Run some things here

end_event.record()
torch.cuda.synchronize()  # Wait for the events to be recorded!
elapsed_time_ms = start_event.elapsed_time(end_event)

作为例外，一些函数如 to() 和 copy_() 接受一个显式的 non_blocking 参数，这使得调用者在不需要时可以绕过同步。另一个例外是CUDA流，将在下面解释。

CUDA 流¶

CUDA 流是属于特定设备的线性执行序列。通常情况下，您不需要显式创建一个：默认情况下，每个设备使用自己的“默认”流。 CUDA 流是属于特定设备的线性执行序列。通常情况下，您不需要显式创建一个：默认情况下，每个设备使用自己的“默认”流。

每个流中的操作按照它们被创建的顺序进行序列化，但来自不同流的操作可以以任何相对顺序并发执行，除非使用显式的同步函数（如 synchronize() 或 wait_stream()）。例如，以下代码是不正确的：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
with torch.cuda.stream(s):
    # sum() may start execution before normal_() finishes!
    B = torch.sum(A)

当“current stream”是默认流时，PyTorch 会在数据移动时自动执行必要的同步操作，如上所述。然而，在使用非默认流时，确保正确的同步操作是用户的责任。

反向传播的流语义¶

每个反向CUDA操作在与相应前向操作相同的流上运行。如果你的前向传递在不同的流上并行运行独立的操作，这有助于反向传递利用相同的并行性。

反向调用相对于周围操作的流语义与其他任何调用相同。即使在反向操作在多个流上运行时，反向传递也会插入内部同步以确保这一点，如前一段所述。更具体地说，当调用 autograd.backward, autograd.grad, 或 tensor.backward, 并可选地提供CUDA张量作为初始梯度（例如， autograd.backward(..., grad_tensors=initial_grads), autograd.grad(..., grad_outputs=initial_grads), 或 tensor.backward(..., gradient=initial_grad)), 的行为是

可选地填充初始梯度（s），
调用反向传播，
使用梯度

具有与任何一组操作相同的流语义关系：

s = torch.cuda.Stream()

# Safe, grads are used in the same stream context as backward()
with torch.cuda.stream(s):
    loss.backward()
    use grads

# Unsafe
with torch.cuda.stream(s):
    loss.backward()
use grads

# Safe, with synchronization
with torch.cuda.stream(s):
    loss.backward()
torch.cuda.current_stream().wait_stream(s)
use grads

# Safe, populating initial grad and invoking backward are in the same stream context
with torch.cuda.stream(s):
    loss.backward(gradient=torch.ones_like(loss))

# Unsafe, populating initial_grad and invoking backward are in different stream contexts,
# without synchronization
initial_grad = torch.ones_like(loss)
with torch.cuda.stream(s):
    loss.backward(gradient=initial_grad)

# Safe, with synchronization
initial_grad = torch.ones_like(loss)
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    initial_grad.record_stream(s)
    loss.backward(gradient=initial_grad)

BC 注意：在默认流上使用梯度¶

在PyTorch的早期版本（1.9及更早版本）中，自动梯度引擎总是将默认流与所有反向操作同步，因此以下模式：

with torch.cuda.stream(s):
    loss.backward()
use grads

只要 use grads 在默认流上发生，就是安全的。在当前的PyTorch中，这种模式不再安全。如果 backward() 和 use grads 处于不同的流上下文中，你必须同步这些流：

with torch.cuda.stream(s):
    loss.backward()
torch.cuda.current_stream().wait_stream(s)
use grads

即使 use grads 在默认流上。

内存管理¶

PyTorch 使用缓存内存分配器来加速内存分配。这允许在不进行设备同步的情况下快速释放内存。然而，由分配器管理的未使用内存仍然会在 nvidia-smi 中显示为已使用。您可以使用 memory_allocated() 和 max_memory_allocated() 来监控张量占用的内存，并使用 memory_reserved() 和 max_memory_reserved() 来监控缓存分配器管理的总内存量。调用 empty_cache() 将释放 PyTorch 中所有未使用的缓存内存，以便其他 GPU 应用程序可以使用这些内存。但是，张量占用的 GPU 内存不会被释放，因此无法增加可用于 PyTorch 的 GPU 内存量。

对于更高级的用户，我们提供更全面的内存基准测试，通过 memory_stats()。我们还提供捕获内存分配器状态完整快照的能力，通过 memory_snapshot()，这可以帮助您理解代码生成的底层分配模式。

使用缓存分配器可能会干扰诸如cuda-memcheck之类的内存检查工具。要使用cuda-memcheck调试内存错误，请在环境中设置PYTORCH_NO_CUDA_MEMORY_CACHING=1以禁用缓存。

通过环境变量可以控制缓存分配器的行为 PYTORCH_CUDA_ALLOC_CONF. 格式为 PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2><value2>... 可用选项：

max_split_size_mb 防止分配器分割大于此大小（以MB为单位）的块。这有助于防止碎片化，并可能允许某些边缘工作负载在不耗尽内存的情况下完成。性能成本根据分配模式的不同，可以从“零”到“显著”。默认值是无限制，即所有块都可以被分割。memory_stats() 和 memory_summary() 方法对于调优很有用。此选项应作为最后的手段，用于处理因“内存不足”而中止且显示大量非活动分割块的工作负载。

cuFFT计划缓存¶

对于每个CUDA设备，使用LRU缓存来存储cuFFT计划，以加速在相同几何和配置下的CUDA张量的重复FFT方法（例如，torch.fft.fft()）运行。由于某些cuFFT计划可能会分配GPU内存，这些缓存有一个最大容量。

您可以使用以下 API 控制和查询当前设备缓存的属性：

torch.backends.cuda.cufft_plan_cache.max_size 提供了缓存的容量（默认值在CUDA 10及以上版本为4096，在较旧的CUDA版本为1023）。直接设置此值会修改容量。
torch.backends.cuda.cufft_plan_cache.size给出当前缓存中的计划数量。
torch.backends.cuda.cufft_plan_cache.clear() 清除缓存。

要控制和查询非默认设备的计划缓存，可以使用torch.backends.cuda.cufft_plan_cache对象，并通过torch.device对象或设备索引进行索引，然后访问上述属性之一。例如，要设置设备1的缓存容量，可以编写torch.backends.cuda.cufft_plan_cache[1].max_size = 10。

最佳实践¶

设备无关代码¶

由于 PyTorch 的结构，您可能需要显式编写设备无关（CPU 或 GPU）代码；例如，创建一个新的张量作为循环神经网络的初始隐藏状态。

第一步是确定是否使用GPU。一种常见模式是使用Python的argparse模块来读取用户参数，并设置一个可以用来禁用CUDA的标志，结合使用 is_available()。在下面的例子中，args.device会生成一个 torch.device对象，可用于将张量移动到CPU或CUDA上。

import argparse
import torch

parser = argparse.ArgumentParser(description='PyTorch Example')
parser.add_argument('--disable-cuda', action='store_true',
                    help='Disable CUDA')
args = parser.parse_args()
args.device = None
if not args.disable_cuda and torch.cuda.is_available():
    args.device = torch.device('cuda')
else:
    args.device = torch.device('cpu')

现在我们有了 args.device，我们可以用它在所需的设备上创建一个张量。

x = torch.empty((8, 42), device=args.device)
net = Network().to(device=args.device)

这可以在多种情况下用于生成与设备无关的代码。下面是一个在使用数据加载器时的例子：

cuda0 = torch.device('cuda:0')  # CUDA GPU 0
for i, x in enumerate(train_loader):
    x = x.to(cuda0)

当在一个系统上使用多个GPU时，你可以使用 CUDA_VISIBLE_DEVICES 环境标志来管理哪些GPU可供 PyTorch 使用。如上所述，要手动控制张量在哪个GPU上创建，最佳实践是使用一个 torch.cuda.device 上下文管理器。

print("Outside device is 0")  # On device 0 (default in most scenarios)
with torch.cuda.device(1):
    print("Inside device is 1")  # On device 1
print("Outside device is still 0")  # On device 0

如果你有一个张量并希望在相同的设备上创建相同类型的新的张量，那么你可以使用torch.Tensor.new_*方法（参见 torch.Tensor）。虽然前面提到的torch.*工厂函数（创建操作）依赖于当前的GPU上下文和你传递的属性参数，torch.Tensor.new_*方法会保留张量的设备和其他属性。

这是在创建模块时的推荐做法，在前向传递过程中需要内部创建新的张量时使用。

cuda = torch.device('cuda')
x_cpu = torch.empty(2)
x_gpu = torch.empty(2, device=cuda)
x_cpu_long = torch.empty(2, dtype=torch.int64)

y_cpu = x_cpu.new_full([3, 2], fill_value=0.3)
print(y_cpu)

    tensor([[ 0.3000,  0.3000],
            [ 0.3000,  0.3000],
            [ 0.3000,  0.3000]])

y_gpu = x_gpu.new_full([3, 2], fill_value=-5)
print(y_gpu)

    tensor([[-5.0000, -5.0000],
            [-5.0000, -5.0000],
            [-5.0000, -5.0000]], device='cuda:0')

y_cpu_long = x_cpu_long.new_tensor([[1, 2, 3]])
print(y_cpu_long)

    tensor([[ 1,  2,  3]])

如果你想创建一个与另一个张量类型和大小相同的张量，并用全1或全0填充，ones_like() 或 zeros_like() 提供了便捷的辅助函数（同时保留张量的 torch.device 和 torch.dtype）。

x_cpu = torch.empty(2, 3)
x_gpu = torch.empty(2, 3)

y_cpu = torch.ones_like(x_cpu)
y_gpu = torch.zeros_like(x_gpu)

使用固定内存缓冲区¶

警告

这是一个高级提示。如果你过度使用固定内存，在内存不足时可能会导致严重问题，并且你应该意识到，固定操作通常是昂贵的。

从固定（页面锁定）内存发起的主机到GPU复制速度更快。CPU张量和存储提供了一个pin_memory()方法，该方法返回对象的一个副本，并将数据放入固定区域。

另外，一旦你固定了一个张量或存储，你可以使用异步GPU拷贝。只需在调用to() 或 cuda()时传递一个额外的non_blocking=True参数。这可以用来重叠数据传输与计算。

您可以将 DataLoader 返回的批次放置在固定内存中，通过在其构造函数中传递 pin_memory=True。

使用 nn.parallel.DistributedDataParallel 而不是 multiprocessing 或 nn.DataParallel¶

大多数涉及批量输入和多GPU的用例应默认使用DistributedDataParallel来利用多个GPU。

使用CUDA模型时存在重要的注意事项； multiprocessing；除非小心处理以满足数据处理要求，否则您的程序可能会出现错误或未定义的行为。

建议使用DistributedDataParallel，而不是DataParallel来进行多GPU训练，即使只有一个节点。

4 和 DistributedDataParallel 与 DataParallel 之间的区别是：DistributedDataParallel 使用多进程，为每个GPU创建一个进程，而 DataParallel 使用多线程。通过使用多进程，每个GPU都有其专用的进程，这避免了由Python解释器的GIL引起性能开销。

如果您使用 DistributedDataParallel, 您可以使用 torch.distributed.launch 工具来启动您的程序，请参阅第三方后端。

CUDA 图形¶

CUDA 图是一个记录，记录了 CUDA 流及其依赖流执行的工作（主要是内核及其参数）。有关底层 CUDA API 的一般原则和详细信息，请参阅 CUDA 图入门和 CUDA C 编程指南的图部分。

PyTorch 支持使用流捕获来构建 CUDA 图，这会将 CUDA 流置于捕获模式。发送到捕获流的 CUDA 工作实际上不会在 GPU 上运行。相反，这些工作会被记录在一个图中。

捕获后，该图可以被启动以运行所需的多次GPU工作。每次重播都会使用相同的内核和相同的参数。对于指针参数来说，这意味着使用相同的内存地址。在每次重播之前通过用新数据（例如，来自新批次的数据）填充输入内存，你可以在新数据上重新运行相同的工作。

为什么使用CUDA图？¶

重新播放图会牺牲典型的即时执行的动态灵活性，以换取显著减少的CPU开销。图的参数和内核是固定的，因此图的重播跳过了所有的参数设置和内核调度层，包括Python、C++和CUDA驱动程序的开销。在内部，重播通过单次调用cudaGraphLaunch将整个图的工作提交给GPU。重播中的内核在GPU上执行得也稍微快一点，但主要的好处是消除了CPU的开销。

如果您的网络全部或部分是图安全的（通常这意味着静态形状和静态控制流，但请参阅其他约束条件），并且您怀疑其运行时至少在某种程度上受到CPU限制，那么您应该尝试CUDA图。

PyTorch API¶

警告

该 API 处于测试阶段，未来版本可能会有所更改。

PyTorch 通过一个原始的 torch.cuda.CUDAGraph 类和两个便捷包装器， torch.cuda.graph 和 torch.cuda.make_graphed_callables。

torch.cuda.graph 是一个简单且通用的上下文管理器，用于捕获其上下文中的CUDA工作。在捕获之前，通过运行几次急切迭代来预热要捕获的工作负载。预热必须在一个旁路流中发生。由于图在每次重放时都会从相同的内存地址读取和写入，因此在捕获期间必须保持对输入和输出数据张量的长期引用。要在新输入数据上运行图，请将新数据复制到捕获的输入张量中，重放图，然后从捕获的输出张量中读取新输出。示例：

g = torch.cuda.CUDAGraph()

# Placeholder input used for capture
static_input = torch.empty((5,), device="cuda")

# Warmup before capture
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for _ in range(3):
        static_output = static_input * 2
torch.cuda.current_stream().wait_stream(s)

# Captures the graph
# To allow capture, automatically sets a side stream as the current stream in the context
with torch.cuda.graph(g):
    static_output = static_input * 2

# Fills the graph's input memory with new data to compute on
static_input.copy_(torch.full((5,), 3, device="cuda"))
g.replay()
# static_output holds the results
print(static_output)  # full of 3 * 2 = 6

# Fills the graph's input memory with more data to compute on
static_input.copy_(torch.full((5,), 4, device="cuda"))
g.replay()
print(static_output)  # full of 4 * 2 = 8

查看全网捕获, 与 torch.cuda.amp 一起使用, 以及与多个流一起使用以了解现实和高级模式。

make_graphed_callables 更加复杂。 make_graphed_callables 接受 Python 函数和 torch.nn.Module。对于每个传递的函数或模块，它创建单独的前向传递和后向传递图。参见部分网络捕获。

约束条件¶

如果一组操作不违反以下任何约束条件，则它是可捕获的。

约束适用于torch.cuda.graph上下文中的所有工作，以及您传递给torch.cuda.make_graphed_callables()的任何可调用对象的前向和后向传递中的所有工作。

违反其中任何一项都可能导致运行时错误：

捕捉必须在一个非默认流上进行。（只有在您使用原始 CUDAGraph.capture_begin 和 CUDAGraph.capture_end 调用时才需要注意。 graph 和 make_graphed_callables() 为您设置一个旁路流。）
Ops that sychronize the CPU with the GPU (e.g., .item() calls) are prohibited.
允许使用CUDA RNG操作，但必须使用默认生成器。例如，显式构造一个新的 torch.Generator 实例并将其作为 generator 参数传递给RNG函数是禁止的。

违反其中任何一项都可能导致无声的数值错误或未定义行为：

在过程中，任何时候只能有一个捕获正在进行。
在捕获进行期间，此进程中（任何线程上）不得运行未被捕获的 CUDA 操作。
CPU 的工作未被捕获。如果被捕获的操作包括 CPU 工作，那么在重放期间会省略这部分工作。
每次重放都会读取和写入相同的（虚拟）内存地址。
禁止动态控制流（基于CPU或GPU数据）。
动态形状不受支持。该图假设捕获的操作序列中的每个张量在每次重放时都具有相同的大小和布局。
在捕获中使用多个流是允许的，但存在限制。

Non-constraints¶

一旦被捕获，该图可以在任何流上重放。

全网络捕获¶

如果你整个网络都可以被捕获，你可以捕获并重放整个迭代：

N, D_in, H, D_out = 640, 4096, 2048, 1024
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
                            torch.nn.Dropout(p=0.2),
                            torch.nn.Linear(H, D_out),
                            torch.nn.Dropout(p=0.1)).cuda()
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Placeholders used for capture
static_input = torch.randn(N, D_in, device='cuda')
static_target = torch.randn(N, D_out, device='cuda')

# warmup
# Uses static_input and static_target here for convenience,
# but in a real setting, because the warmup includes optimizer.step()
# you must use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        y_pred = model(static_input)
        loss = loss_fn(y_pred, static_target)
        loss.backward()
        optimizer.step()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
# Sets grads to None before capture, so backward() will create
# .grad attributes with allocations from the graph's private pool
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    static_y_pred = model(static_input)
    static_loss = loss_fn(static_y_pred, static_target)
    static_loss.backward()
    optimizer.step()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    # Fills the graph's input memory with new data to compute on
    static_input.copy_(data)
    static_target.copy_(target)
    # replay() includes forward, backward, and step.
    # You don't even need to call optimizer.zero_grad() between iterations
    # because the captured backward refills static .grad tensors in place.
    g.replay()
    # Params have been updated. static_y_pred, static_loss, and .grad
    # attributes hold values from computing on this iteration's data.

部分网络捕获¶

如果网络的一部分由于动态控制流、动态形状、CPU同步或必要的CPU端逻辑等原因而不安全捕获，您可以选择急切运行不安全的部分，并使用torch.cuda.make_graphed_callables()仅捕获安全的部分。

默认情况下，由make_graphed_callables()返回的可调用函数是自动求梯度感知的，并且可以在训练循环中作为您传递的函数或nn.Module的直接替代。

make_graphed_callables() 内部创建 CUDAGraph 对象，运行预热迭代，并根据需要维护静态输入和输出。因此（与 torch.cuda.graph 不同），你不需要手动处理这些。

在以下示例中，数据依赖的动态控制流意味着网络无法端到端捕获，但 make_graphed_callables() 允许我们捕获并以图的形式安全运行各个部分：

N, D_in, H, D_out = 640, 4096, 2048, 1024

module1 = torch.nn.Linear(D_in, H).cuda()
module2 = torch.nn.Linear(H, D_out).cuda()
module3 = torch.nn.Linear(H, D_out).cuda()

loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(chain(module1.parameters() +
                                  module2.parameters() +
                                  module3.parameters()),
                            lr=0.1)

# Sample inputs used for capture
# requires_grad state of sample inputs must match
# requires_grad state of real inputs each callable will see.
x = torch.randn(N, D_in, device='cuda')
h = torch.randn(N, H, device='cuda', requires_grad=True)

module1 = torch.cuda.make_graphed_callables(module1, (x,))
module2 = torch.cuda.make_graphed_callables(module2, (h,))
module3 = torch.cuda.make_graphed_callables(module3, (h,))

real_inputs = [torch.rand_like(x) for _ in range(10)]
real_targets = [torch.randn(N, D_out, device="cuda") for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    optimizer.zero_grad(set_to_none=True)

    tmp = module1(data)  # forward ops run as a graph

    if tmp.sum().item() > 0:
        tmp = module2(tmp)  # forward ops run as a graph
    else:
        tmp = module3(tmp)  # forward ops run as a graph

    loss = loss_fn(tmp, y)
    # module2's or module3's (whichever was chosen) backward ops,
    # as well as module1's backward ops, run as graphs
    loss.backward()
    optimizer.step()

与torch.cuda amp的用法¶

对于典型的优化器，GradScaler.step 同步 CPU 和 GPU，这是在捕获期间禁止的。为了避免错误，请使用部分网络捕获，或者（如果前向、损失和反向是捕获安全的）捕获前向、损失和反向但不捕获优化器步骤：

# warmup
# In a real setting, use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        with torch.cuda.amp.autocast():
            y_pred = model(static_input)
            loss = loss_fn(y_pred, static_target)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    with torch.cuda.amp.autocast():
        static_y_pred = model(static_input)
        static_loss = loss_fn(static_y_pred, static_target)
    scaler.scale(static_loss).backward()
    # don't capture scaler.step(optimizer) or scaler.update()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    static_input.copy_(data)
    static_target.copy_(target)
    g.replay()
    # Runs scaler.step and scaler.update eagerly
    scaler.step(optimizer)
    scaler.update()

多流使用¶

捕获模式会自动传播到与捕获流同步的任何流中。在捕获过程中，您可以通过向不同的流发出调用来暴露并行性，但在捕获开始后，整体流依赖关系 DAG 必须从初始捕获流分支出去，并在捕获结束前重新合并到初始流中：

with torch.cuda.graph(g):
    # at context manager entrance, torch.cuda.current_stream()
    # is the initial capturing stream

    # INCORRECT (does not branch out from or rejoin initial stream)
    with torch.cuda.stream(s):
        cuda_work()

    # CORRECT:
    # branches out from initial stream
    s.wait_stream(torch.cuda.current_stream())
    with torch.cuda.stream(s):
        cuda_work()
    # rejoins initial stream before capture ends
    torch.cuda.current_stream().wait_stream(s)

注意

为了避免混淆那些在nsight systems或nvprof中查看重放的高级用户：与即时执行不同，在捕获过程中，图将非平凡的流DAG视为建议，而非命令。在重放期间，图可能会重新组织独立的操作到不同的流上，或者以不同的顺序入队（同时尊重原始DAG的整体依赖关系）。

使用DistributedDataParallel¶

NCCL < 2.9.6¶

NCCL 版本早于 2.9.6 不允许捕获集合操作。你必须使用部分网络捕获，这会将所有的 allreduces 延迟到图的反向传播部分之外进行。

调用make_graphed_callables()在可绘图网络部分之前使用DDP包装网络。

NCCL ≥ 2.9.6¶

NCCL 2.9.6 版或更高版本允许在图中进行集体通信。捕获整个反向传播过程的方法是可行的选项，但需要三个设置步骤。

禁用DDP的内部异步错误处理：

os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "0"
torch.distributed.init_process_group(...)

在完整反向传播捕获之前，DDP 必须在一个旁流上下文中构建。
```
with torch.cuda.stream(s):
    model = DistributedDataParallel(model)
```
您的热身运行必须在捕获之前至少进行 11 次启用 DDP 的急切迭代。

图形内存管理¶

捕获的图每次重放时都在相同的虚拟地址上执行。如果PyTorch释放了内存，后续的重放可能会遇到非法内存访问。如果PyTorch将内存重新分配给新的张量，重放可能会破坏这些张量看到的值。因此，图使用的虚拟地址必须在重放期间保留给该图。PyTorch的缓存分配器通过检测何时开始捕获并从图专用的内存池中满足捕获的分配来实现这一点。该专用池会一直存在，直到它的 CUDAGraph 对象和捕获过程中创建的所有张量都超出作用域。

私有池会自动维护。默认情况下，分配器为每个捕获创建一个单独的私有池。如果你捕获了多个图，这种保守的方法可以确保图的重播永远不会相互破坏各自的值，但有时会无谓地浪费内存。

CUDA语义¶

Ampere设备上的TensorFloat-32(TF32)¶

异步执行¶

CUDA 流¶

反向传播的流语义¶

BC 注意：在默认流上使用梯度¶

内存管理¶

cuFFT计划缓存¶

最佳实践¶

设备无关代码¶

使用固定内存缓冲区¶

使用 nn.parallel.DistributedDataParallel 而不是 multiprocessing 或 nn.DataParallel¶

CUDA 图形¶

为什么使用CUDA图？¶

PyTorch API¶

约束条件¶

Non-constraints¶

全网络捕获¶

部分网络捕获¶

与torch.cuda amp的用法¶

多流使用¶

使用DistributedDataParallel¶

NCCL < 2.9.6¶

NCCL ≥ 2.9.6¶

图形内存管理¶

文档

教程

资源

CUDA语义¶

Ampere设备上的TensorFloat-32(TF32)¶

异步执行¶

CUDA 流¶

反向传播的流语义¶

BC 注意：在默认流上使用梯度¶

内存管理¶

cuFFT计划缓存¶

最佳实践¶

设备无关代码¶

使用固定内存缓冲区¶

使用 nn.parallel.DistributedDataParallel 而不是 multiprocessing 或 nn.DataParallel¶

CUDA 图形¶

为什么使用CUDA图？¶

PyTorch API¶

约束条件¶

Non-constraints¶

全网络捕获¶

部分网络捕获¶

与torch.cuda amp的用法¶

多流使用¶

使用DistributedDataParallel¶

NCCL < 2.9.6¶

NCCL ≥ 2.9.6¶

图形内存管理¶

在捕获之间共享内存¶

文档

教程

资源