CUDA 语义¶

torch.cuda用于设置和运行 CUDA作。它跟踪当前选择的 GPU，默认情况下，您分配的所有 CUDA 张量都将是在该设备上创建。所选设备可通过torch.cuda.device上下文管理器。

但是，一旦分配了张量，您就可以对它执行任何作所选设备的 device 作为张量。

默认情况下不允许跨 GPU作，但copy_()和其他具有类似复制功能的方法如to()和cuda(). 除非您启用点对点内存访问，否则任何在分布在不同设备上的 Tensor 将引发错误。

您可以在下面找到一个小示例来展示这一点：

cuda = torch.device('cuda')     # Default CUDA device
cuda0 = torch.device('cuda:0')
cuda2 = torch.device('cuda:2')  # GPU 2 (these are 0-indexed)

x = torch.tensor([1., 2.], device=cuda0)
# x.device is device(type='cuda', index=0)
y = torch.tensor([1., 2.]).cuda()
# y.device is device(type='cuda', index=0)

with torch.cuda.device(1):
    # allocates a tensor on GPU 1
    a = torch.tensor([1., 2.], device=cuda)

    # transfers a tensor from CPU to GPU 1
    b = torch.tensor([1., 2.]).cuda()
    # a.device and b.device are device(type='cuda', index=1)

    # You can also use ``Tensor.to`` to transfer a tensor:
    b2 = torch.tensor([1., 2.]).to(device=cuda)
    # b.device and b2.device are device(type='cuda', index=1)

    c = a + b
    # c.device is device(type='cuda', index=1)

    z = x + y
    # z.device is device(type='cuda', index=0)

    # even within a context, you can specify the device
    # (or give a GPU index to the .cuda call)
    d = torch.randn(2, device=cuda2)
    e = torch.randn(2).to(cuda2)
    f = torch.randn(2).cuda(cuda2)
    # d.device, e.device, and f.device are all device(type='cuda', index=2)

Ampere（及更高版本）设备上的 TensorFloat-32 （TF32）¶

从 PyTorch 1.7 开始，有一个名为 allow_tf32 的新标志。此标志在 PyTorch 1.7 到 PyTorch 1.11 中默认为 True，在 PyTorch 1.12 及更高版本中默认为 False。此标志控制是否允许 PyTorch 使用 TensorFloat32 （TF32）张量核心，自 Ampere 起在 NVIDIA GPU 上可用，在内部计算 matmul（矩阵乘法和批量矩阵乘法）和卷积。

TF32 张量核心旨在通过将输入数据四舍五入为具有 10 位尾数并累加，在 matmul 和 torch.float32 张量上的卷积上实现更好的性能以 FP32 精度获得结果，同时保持 FP32 动态范围。

Matmuls 和 Convolutions 是单独控制的，它们对应的标志可以在以下位置访问：

# The flag below controls whether to allow TF32 on matmul. This flag defaults to False
# in PyTorch 1.12 and later.
torch.backends.cuda.matmul.allow_tf32 = True

# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True

matmuls 的精度也可以通过 . 请注意，除了 matmuls 和 convolutions 本身之外，内部使用的 functions 和 nn 模块 matmuls 或 convolutions 也会受到影响。这些包括 nn.线性，nn。Conv*、cdist、tensordot、仿射网格和网格样本、自适应对数 softmax、GRU 和 LSTM。set_float_32_matmul_precision()

要了解精度和速度，请参阅下面的示例代码和基准测试数据（在 A100 上）：

a_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
b_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
ab_full = a_full @ b_full
mean = ab_full.abs().mean()  # 80.7277

a = a_full.float()
b = b_full.float()

# Do matmul at TF32 mode.
torch.backends.cuda.matmul.allow_tf32 = True
ab_tf32 = a @ b  # takes 0.016s on GA100
error = (ab_tf32 - ab_full).abs().max()  # 0.1747
relative_error = error / mean  # 0.0022

# Do matmul with TF32 disabled.
torch.backends.cuda.matmul.allow_tf32 = False
ab_fp32 = a @ b  # takes 0.11s on GA100
error = (ab_fp32 - ab_full).abs().max()  # 0.0031
relative_error = error / mean  # 0.000039

从上面的例子中，我们可以看到，在启用 TF32 的情况下，A100 的速度快了 ~7 倍，并且与双精度相比，相对误差大约大 2 个数量级。请注意， TF32 与单精度速度的确切比率取决于硬件代次，因为属性例如内存带宽与计算的比率以及 TF32 与 FP32 matmul 吞吐量的比率可能因代际或型号而异。如果需要完整的 FP32 精度，用户可以通过以下方式禁用 TF32：

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

要在 C++ 中关闭 TF32 标志，您可以执行

at::globalContext().setAllowTF32CuBLAS(false);
at::globalContext().setAllowTF32CuDNN(false);

有关 TF32 的更多信息，请参阅：

FP16 GEMM 的精度降低¶

fp16 GEMM 可能会通过一些中等降低的精度降低来完成（例如，在 fp16 而不是 fp32 中）。这些选择性的精度降低可以在某些工作负载（尤其是具有大 k 维度的工作负载）和 GPU 架构上实现更高的性能，但代价是数值精度和溢出的可能性。

V100 上的一些示例基准测试数据：

[--------------------------- bench_gemm_transformer --------------------------]
      [  m ,  k  ,  n  ]    |  allow_fp16_reduc=True  |  allow_fp16_reduc=False
1 threads: --------------------------------------------------------------------
      [4096, 4048, 4096]    |           1634.6        |           1639.8
      [4096, 4056, 4096]    |           1670.8        |           1661.9
      [4096, 4080, 4096]    |           1664.2        |           1658.3
      [4096, 4096, 4096]    |           1639.4        |           1651.0
      [4096, 4104, 4096]    |           1677.4        |           1674.9
      [4096, 4128, 4096]    |           1655.7        |           1646.0
      [4096, 4144, 4096]    |           1796.8        |           2519.6
      [4096, 5096, 4096]    |           2094.6        |           3190.0
      [4096, 5104, 4096]    |           2144.0        |           2663.5
      [4096, 5112, 4096]    |           2149.1        |           2766.9
      [4096, 5120, 4096]    |           2142.8        |           2631.0
      [4096, 9728, 4096]    |           3875.1        |           5779.8
      [4096, 16384, 4096]   |           6182.9        |           9656.5
(times in microseconds).

如果需要全精度降低，用户可以使用以下命令在 fp16 GEMM 中禁用降低精度：

torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False

要在 C++ 中切换降低精度的降低标志，可以执行

at::globalContext().setAllowFP16ReductionCuBLAS(false);

BF16 GEMM 的精度降低¶

BFloat16 GEMM 存在类似的标志（如上所述）。请注意，如果您观察到 BF16 的此开关默认设置为 True 工作负载中的数值不稳定，您可能希望将其设置为 False。

如果不需要降低的精度降低，用户可以禁用 reduced bf16 GEMM 的精度降低：

torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False

要在 C++ 中切换降低精度的降低标志，可以执行

at::globalContext().setAllowBF16ReductionCuBLAS(true);

异步执行¶

默认情况下，GPU作是异步的。当您调用一个函数时，使用 GPU，则作将排入特定设备的队列，但不会必须执行到以后。这使我们能够执行更多的计算并行，包括对 CPU 或其他 GPU 的作。

通常，异步计算的效果对调用者来说是不可见的。因为（1）每个设备都按照它们排队的顺序执行作，并且（2） PyTorch 在复制数据时自动执行必要的同步在 CPU 和 GPU 之间或两个 GPU 之间。因此，计算将像每个作都是同步执行的。

您可以通过设置 environment variable 来强制同步计算。当 GPU 上发生错误时，这可能很方便。（使用异步执行时，只有在作，因此堆栈跟踪不会显示它的位置要求。CUDA_LAUNCH_BLOCKING=1

异步计算的结果是，没有同步不准确。为了获得精确的测量结果，您应该叫torch.cuda.synchronize()测量前，或使用torch.cuda.Event记录时间，如下所示：

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()

# Run some things here

end_event.record()
torch.cuda.synchronize()  # Wait for the events to be recorded!
elapsed_time_ms = start_event.elapsed_time(end_event)

作为一个例外，一些函数（例如to()和copy_()承认一个明确的论点，这允许调用方在不需要时绕过同步。另一个例外是 CUDA 流，如下所述。non_blocking

CUDA 流¶

CUDA 流是属于特定装置。您通常不需要显式创建一个：默认情况下，每个 device 使用自己的 “default” 流。

每个流中的作都按照它们的创建顺序进行序列化。但是来自不同流的作可以在任何 relative order 的 Relative order 中，除非显式同步函数（如synchronize()或wait_stream()）是使用。例如，以下代码不正确：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
with torch.cuda.stream(s):
    # sum() may start execution before normal_() finishes!
    B = torch.sum(A)

当 “current stream” 是默认流时，PyTorch 会自动执行如上所述，当数据四处移动时需要同步。但是，在使用非默认流时，用户有责任确保正确同步。此示例的固定版本为：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
s.wait_stream(torch.cuda.default_stream(cuda))  # NEW!
with torch.cuda.stream(s):
    B = torch.sum(A)
A.record_stream(s)  # NEW!

新增了两个功能。这torch.cuda.Stream.wait_stream()叫确保在我们开始在 Side Stream 上运行之前执行已完成。这normal_()sum(A)torch.Tensor.record_stream()（请参阅 more details）确保我们不会在 has 之前释放 A 完成。您也可以在稍后的某个时间点手动等待流 time with （请注意，它立即等待毫无意义，因为这会阻止 Stream 执行与默认流上的其他工作并行运行。请参阅文档sum(A)torch.cuda.default_stream(cuda).wait_stream(s)torch.Tensor.record_stream()关于何时的更多详细信息以使用 one 或另一个。

请注意，即使没有读取依赖项，例如，如以下示例所示：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda)
s.wait_stream(torch.cuda.default_stream(cuda))  # STILL REQUIRED!
with torch.cuda.stream(s):
    A.normal_(0.0, 1.0)
    A.record_stream(s)

尽管计算 on not 读取和 no 的内容的其他用途，仍然需要同步，因为可能对应于 CUDA 缓存分配器重新分配的内存，其中来自旧（已解除分配）内存的 pending作。sAAA

向后传递的流语义¶

每个向后 CUDA作在用于其相应正向作的同一流上运行。如果您的前向传递在不同的流上并行运行独立的作，这有助于 backward pass 利用相同的并行度。

相对于周围作的向后调用的流语义是相同的至于任何其他电话。backward pass 会插入内部同步以确保这一点，即使如上一段所述，向后作在多个流上运行。更具体地说，当调用autograd.backward,autograd.grad或tensor.backward, 以及可选地提供 CUDA 张量作为初始梯度（例如，autograd.backward(..., grad_tensors=initial_grads),autograd.grad(..., grad_outputs=initial_grads)或tensor.backward(..., gradient=initial_grad)), 使徒行传

（可选）填充初始梯度，
调用向后传递，以及
使用渐变

与任何一组 Ops 具有相同的 stream-semantics 关系：

s = torch.cuda.Stream()

# Safe, grads are used in the same stream context as backward()
with torch.cuda.stream(s):
    loss.backward()
    use grads

# Unsafe
with torch.cuda.stream(s):
    loss.backward()
use grads

# Safe, with synchronization
with torch.cuda.stream(s):
    loss.backward()
torch.cuda.current_stream().wait_stream(s)
use grads

# Safe, populating initial grad and invoking backward are in the same stream context
with torch.cuda.stream(s):
    loss.backward(gradient=torch.ones_like(loss))

# Unsafe, populating initial_grad and invoking backward are in different stream contexts,
# without synchronization
initial_grad = torch.ones_like(loss)
with torch.cuda.stream(s):
    loss.backward(gradient=initial_grad)

# Safe, with synchronization
initial_grad = torch.ones_like(loss)
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    initial_grad.record_stream(s)
    loss.backward(gradient=initial_grad)

BC 说明：在默认流上使用 grads¶

在以前的 PyTorch 版本（1.9 及更早版本）中，autograd 引擎始终同步包含所有向后作的默认流，因此以下模式：

with torch.cuda.stream(s):
    loss.backward()
use grads

只要发生在默认流上，就是安全的。在目前的 PyTorch 中，该模式不再安全。如果和位于不同的流上下文中，则必须同步流：use gradsbackward()use grads

with torch.cuda.stream(s):
    loss.backward()
torch.cuda.current_stream().wait_stream(s)
use grads

即使位于默认流中。use grads

内存管理¶

PyTorch 使用缓存内存分配器来加快内存分配速度。这允许在不同步设备的情况下快速释放内存。但是，分配器管理的未使用内存仍将显示为，就像在中使用一样。您可以使用nvidia-smimemory_allocated()和max_memory_allocated()监视张量，并使用memory_reserved()和max_memory_reserved()监视内存总量由 Caching 分配器管理。叫empty_cache()从 PyTorch 中释放所有未使用的缓存内存，以便可以使用这些内存通过其他 GPU 应用程序。但是，张量占用的 GPU 内存不会，因此它不会增加可用于 PyTorch 的 GPU 内存量。

为了更好地了解 CUDA 内存在一段时间内的使用情况，了解 CUDA 内存使用情况介绍了用于捕获和可视化内存使用痕迹的工具。

对于更高级的用户，我们通过以下方式提供更全面的内存基准测试memory_stats().我们还提供捕获内存分配器状态的完整快照memory_snapshot()，这可以帮助您了解您的代码生成的底层分配模式。

优化内存使用`PYTORCH_CUDA_ALLOC_CONF`¶

使用缓存分配器可能会干扰内存检查工具，例如 .要使用调试内存错误，请在您的环境中设置以禁用缓存。cuda-memcheckcuda-memcheckPYTORCH_NO_CUDA_MEMORY_CACHING=1

缓存分配器的行为可以通过 environment variable 来控制。格式为 Available options：PYTORCH_CUDA_ALLOC_CONFPYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2>:<value2>...

backend允许选择底层 allocator 实现。目前，有效的选项是，它使用 PyTorch 的原生实现，以及使用 CUDA 的内置异步分配器。需要 CUDA 11.4 或更高版本。默认值为 . 适用于进程使用的所有设备，并且不能按设备指定。nativecudaMallocAsynccudaMallocAsyncnativebackend
max_split_size_mb阻止本机分配器通过拆分大于此大小（以 MB 为单位）的块。这可以减少碎片化，并且可能允许一些边缘工作负载在没有内存不足。性能成本范围可以从 “零” 到 “大量” 取决于分配模式。默认值为 unlimited，即 all blocks 可以拆分。这memory_stats()和memory_summary()方法对于优化非常有用。这选项应用作正在中止的工作负载的最后手段由于 'out of memory' 并显示大量非活动拆分块。仅对有意义。其中，将被忽略。max_split_size_mbbackend:nativebackend:cudaMallocAsyncmax_split_size_mb
roundup_power2_divisions帮助舍入请求的分配 size 到最近的 2 次方划分并更好地利用块。在原生 CUDACachingAllocator，则大小以倍数为单位四舍五入的块大小为 512，因此这适用于较小的大小。但是，此对于大型 near-by 分配可能效率低下，因为每个分配将转到不同的块的大小和这些块的重用被最小化。这可能会创建大量未使用的块，会浪费 GPU 内存容量。此选项启用将分配大小舍入到最接近的 2 次方除法。例如，如果我们需要将大小四舍五入为 1200，如果除数为 4，大小 1200 位于 1024 和 2048 之间，如果我们在它们的值为 1024、1280、1536 和 1792。因此，分配大小为 1200 将四舍五入到 1280 作为 2 次方除法的最接近上限。指定单个值以应用于所有分配大小，或指定键值对数组，用于为每个键值对单独设置 Power-2 除法 2 间隔的幂。例如，为所有分配设置 1 个除法小于 256MB，2 个分区用于 256MB 和 512MB 之间的分配，4 个分区对于 512MB 到 1GB 之间的分配，以及 8 个分区对于任何更大的分配，将旋钮值设置为：[256：1,512：2,1024：4，>：8]。仅对有意义。其中，将被忽略。roundup_power2_divisionsbackend:nativebackend:cudaMallocAsyncroundup_power2_divisions
garbage_collection_threshold有助于主动回收未使用的 GPU 内存避免触发昂贵的 sync-and-reclaim-all作（release_cached_blocks），这可能不利于延迟关键型 GPU 应用程序（例如服务器）。设置此阈值（例如 0.8）后，分配器将开始回收如果 GPU 内存容量使用率超过阈值（即分配给 GPU 应用程序的总内存的 80%）。该算法首选首先释放旧的和未使用的块，以避免释放正在活跃的块。重用。阈值应介于大于 0.0 和小于 1.0 之间。仅对有意义。其中，将被忽略。garbage_collection_thresholdbackend:nativebackend:cudaMallocAsyncgarbage_collection_threshold
expandable_segments（实验性，默认值：False）如果设置为 True，则此设置指示分配器用于创建 CUDA 分配，这些分配稍后可以扩展以更好地处理情况作业频繁更改分配大小，例如更改批处理大小。通常，对于大型（>2MB）分配，分配器调用 cudaMalloc 来获取分配的大小与用户请求的大小相同。将来，这些如果分配是免费的，则可以将其重新用于其他请求。这很有效当程序发出许多大小完全相同或大小甚至是该大小的倍数。许多深度学习模型都遵循此行为。但是，一个常见的例外情况是，批处理大小从 1 迭代到下一个，例如在批量推理中。程序运行时最初使用批处理大小 N 时，它将使分配适合该大小。如果将来它以大小 N - 1 运行，则现有分配仍将为足够大。但是，如果它以大小 N + 1 运行，那么它将必须创建新的稍大的 allocations。并非所有张量的大小都相同。有些可能是（N + 1）*A，有些可能是（N + 1）*A*B，其中 A 和 B 是一些非批处理维度。因为分配器在它们足够大，一些（N + 1）*A 分配实际上会适合已经存在的 N*B*A 片段，尽管并不完美。当模型运行时将部分填充所有这些段，留下不可用的空闲切片 memory 的 intent 值。分配器在某些时候需要 cudaMalloc 一个新的（N + 1）*A*B 段。如果内存不足，则有现在无法恢复在现有段。对于深度为 50+ 层的模型，此模式可能会重复 50+ 次创建许多 sliver。

expandable_segments 允许分配器最初创建一个 Segment，然后稍后需要更多内存时扩展其大小。而不是制作一个路段每个分配，它尝试制作一个分段（每个流），该分段增长为必要。现在，当 N + 1 案例运行时，分配将很好地平铺到一个大段，直到它填满。然后请求更多内存，并且附加到区段的末尾。此过程不会创建那么多的 sliver 的不可用内存，因此更有可能成功找到此内存。

pinned_use_cuda_host_register 选项是一个布尔标志，用于确定是否改用 CUDA API 的 cudaHostRegister 函数来分配固定内存默认的 cudaHostAlloc 中。当设置为 True 时，使用常规 malloc 然后页面映射到内存，然后再调用 cudaHostRegister 。这种页面的预映射有助于减少执行期间的锁定时间 cudaHostRegister 的。

pinned_num_register_threads选项仅在pinned_use_cuda_host_register时有效设置为 True。默认情况下，一个线程用于映射页面。此选项允许使用更多线程并行化页面映射作，以降低总体固定内存的分配时间。此选项的一个很好的值为 8，基于基准测试结果。

注意

CUDA 内存管理 API 报告的某些统计信息特定于，对没有意义。有关详细信息，请参阅每个函数的文档字符串。backend:nativebackend:cudaMallocAsync

使用 CUDA 的自定义内存分配器¶

可以在 C/C++ 中将分配器定义为简单函数并编译它们作为共享库，下面的代码显示了一个基本的分配器，它只是跟踪所有内存作。

#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>
// Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda/include -shared -fPIC
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   cudaMalloc(&ptr, size);
   std::cout<<"alloc "<<ptr<<size<<std::endl;
   return ptr;
}

void my_free(void* ptr, ssize_t size, int device, cudaStream_t stream) {
   std::cout<<"free "<<ptr<< " "<<stream<<std::endl;
   cudaFree(ptr);
}
}

这可以在 python 中通过torch.cuda.memory.CUDAPluggableAllocator. 用户负责提供 .so 文件的路径和名称的 alloc/free 函数中。

import torch

# Load the allocator
new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
    'alloc.so', 'my_malloc', 'my_free')
# Swap the current allocator
torch.cuda.memory.change_current_allocator(new_alloc)
# This will allocate memory in the device using the new allocator
b = torch.zeros(10, device='cuda')

import torch

# Do an initial memory allocator
b = torch.zeros(10, device='cuda')
# Load the allocator
new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
    'alloc.so', 'my_malloc', 'my_free')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(new_alloc)

cuBLAS 工作区¶

对于 cuBLAS 句柄和 CUDA 流的每种组合，将分配一个 cuBLAS 工作空间如果该句柄和流组合执行需要工作空间的 cuBLAS 内核。为了避免重复分配工作区，除非调用，否则不会释放这些工作区。每个分配的工作区大小可以是通过格式为 . 例如，每个分配的默认工作区大小是，它指定的总大小为。要强制 cuBLAS 避免使用工作区，设置。torch._C._cuda_clearCublasWorkspaces()CUBLAS_WORKSPACE_CONFIG:[SIZE]:[COUNT]CUBLAS_WORKSPACE_CONFIG=:4096:2:16:82 * 4096 + 8 * 16 KiBCUBLAS_WORKSPACE_CONFIG=:0:0

cuFFT 计划缓存¶

对于每个 CUDA 设备，使用 cuFFT 计划的 LRU 缓存来重复加速运行 FFT 方法（例如torch.fft.fft()）在相同几何结构的 CUDA 张量上具有相同的配置。由于某些 cuFFT 计划可能会分配 GPU 内存，这些缓存具有最大容量。

您可以使用以下 API：

torch.backends.cuda.cufft_plan_cache.max_size给出缓存（在 CUDA 10 及更高版本上默认为 4096，在较旧的 CUDA 版本上为 1023）。设置此值会直接修改容量。
torch.backends.cuda.cufft_plan_cache.size给出计划的数量当前驻留在缓存中。
torch.backends.cuda.cufft_plan_cache.clear()清除缓存。

要控制和查询非默认设备的计划缓存，您可以使用torch.backends.cuda.cufft_plan_cachetorch.deviceobject 或 device 索引，并访问上述属性之一。例如，将设备的缓存容量，可以写入。1torch.backends.cuda.cufft_plan_cache[1].max_size = 10

Just-in-Time 编译¶

PyTorch 即时编译一些作，例如 torch.special.zeta，当在 CUDA 张量上执行。此编译可能非常耗时（最多几秒钟，具体取决于您的硬件和软件）并且对于单个算子可能会发生多次，因为许多 PyTorch 算子实际上从各种内核中进行选择，每个内核都必须编译一次，具体取决于它们的输入。此编译每个进程发生一次，如果使用内核缓存，则只发生一次。

默认情况下，如果满足以下条件，PyTorch 会在 $XDG_CACHE_HOME/torch/kernels 中创建内核缓存 XDG_CACHE_HOME 已定义，如果未定义，则$HOME/.cache/torch/kernels（在 Windows 上除外，中，内核缓存尚不受支持）。缓存行为可以直接由两个环境变量控制。如果 USE_PYTORCH_KERNEL_CACHE 设置为 0 则 no cache ，如果设置了 PYTORCH_KERNEL_CACHE_PATH，则将使用该路径作为内核缓存，而不是默认位置。

最佳实践¶

与设备无关的代码¶

由于 PyTorch 的结构，您可能需要显式编写与设备无关（CPU 或 GPU）代码;例如，创建一个新的 Tensor as 递归神经网络的初始隐藏状态。

第一步是确定是否应该使用 GPU。常见的 pattern 是使用 Python 的模块来读取 user 参数，而具有可用于禁用 CUDA 的标志，与argparseis_available().在下文中，将生成一个args.devicetorch.device对象，该对象可用于将张量移动到 CPU 或 CUDA。

import argparse
import torch

parser = argparse.ArgumentParser(description='PyTorch Example')
parser.add_argument('--disable-cuda', action='store_true',
                    help='Disable CUDA')
args = parser.parse_args()
args.device = None
if not args.disable_cuda and torch.cuda.is_available():
    args.device = torch.device('cuda')
else:
    args.device = torch.device('cpu')

注意

在评估 CUDA 在给定环境中的可用性（is_available()），PyTorch 的默认值行为是调用 CUDA 运行时 API 方法 cudaGetDeviceCount。因为这个调用反过来又会初始化 CUDA 驱动程序 API （通过 cuInit），如果尚未初始化，则为已运行的进程的后续分叉is_available()将失败，并显示 CUDA 初始化错误。

可以在导入执行 PyTorch 的 PyTorch 模块之前在您的环境中进行设置PYTORCH_NVML_BASED_CUDA_CHECK=1is_available()（或在直接执行之前）以便直接is_available()尝试基于 NVML 的评估（nvmlDeviceGetCount_v2）。如果基于 NVML 的评估成功（即 NVML 发现/初始化未失败），is_available()调用不会毒害后续的 fork。

如果 NVML 发现/初始化失败，is_available()将回退到标准 CUDA 运行时 API 评估和上述 fork 约束将适用。

请注意，上述基于 NVML 的 CUDA 可用性评估提供的保证比默认 CUDA 弱运行时 API 方法（需要 CUDA 初始化才能成功）。在某些情况下，基于 NVML 的检查可能会成功，而稍后的 CUDA 初始化会失败。

现在我们有了，我们可以使用它在所需的设备。args.device

x = torch.empty((8, 42), device=args.device)
net = Network().to(device=args.device)

这在许多情况下可用于生成与设备无关的代码。下面是使用 Dataloader 时的示例：

cuda0 = torch.device('cuda:0')  # CUDA GPU 0
for i, x in enumerate(train_loader):
    x = x.to(cuda0)

在系统上使用多个 GPU 时，您可以使用 environment 标志来管理哪些 GPU 可用 PyTorch 的 Torch 中。如上所述，要手动控制创建张量的 GPU 打开，最佳实践是使用CUDA_VISIBLE_DEVICEStorch.cuda.device上下文管理器。

print("Outside device is 0")  # On device 0 (default in most scenarios)
with torch.cuda.device(1):
    print("Inside device is 1")  # On device 1
print("Outside device is still 0")  # On device 0

如果您有一个张量，并且想要在相同的设备，那么你可以使用（参见torch.Tensor.new_*torch.Tensor). 虽然前面提到的工厂函数（创建作）依赖于当前的 GPU 上下文，并且您传入的 attributes 参数，方法保留张量的 device 和其他属性。torch.*torch.Tensor.new_*

这是在创建新需要在 forward pass 期间在内部创建 Tensor。

cuda = torch.device('cuda')
x_cpu = torch.empty(2)
x_gpu = torch.empty(2, device=cuda)
x_cpu_long = torch.empty(2, dtype=torch.int64)

y_cpu = x_cpu.new_full([3, 2], fill_value=0.3)
print(y_cpu)

    tensor([[ 0.3000,  0.3000],
            [ 0.3000,  0.3000],
            [ 0.3000,  0.3000]])

y_gpu = x_gpu.new_full([3, 2], fill_value=-5)
print(y_gpu)

    tensor([[-5.0000, -5.0000],
            [-5.0000, -5.0000],
            [-5.0000, -5.0000]], device='cuda:0')

y_cpu_long = x_cpu_long.new_tensor([[1, 2, 3]])
print(y_cpu_long)

    tensor([[ 1,  2,  3]])

如果要创建与另一个 Tensor 相同类型和大小的 Tensor，并且用 1 或 0 填充它，ones_like()或zeros_like()作为方便的辅助函数提供（其还保留torch.device和torch.dtype的 Tensor）。

x_cpu = torch.empty(2, 3)
x_gpu = torch.empty(2, 3)

y_cpu = torch.ones_like(x_cpu)
y_gpu = torch.zeros_like(x_gpu)

使用固定的内存缓冲区¶

警告

这是一个高级提示。如果过度使用固定内存，可能会导致严重的的问题，您应该知道固定是通常是一项昂贵的作。

主机到 GPU 的副本源自固定（页面锁定）时要快得多记忆。CPU 张量和存储公开一个pin_memory()方法，该方法返回对象的副本，并将数据放在固定区域中。

此外，固定张量或存储后，您可以使用异步 GPU 副本。只需将一个额外的参数传递给non_blocking=Trueto()或cuda()叫。这可以使用将数据传输与计算重叠。

您可以使DataLoader返回放置在固定内存。pin_memory=True

使用 nn.parallel.DistributedDataParallel 而不是 multiprocessing 或 nn.DataParallel 数据并行¶

大多数涉及批处理输入和多个 GPU 的用例应默认为用DistributedDataParallel利用更多而不是一个 GPU。

使用 CUDA 模型时，需要注意一些multiprocessing;除非小心翼翼地满足数据处理要求要求，则您的程序可能会具有不正确或 undefined 行为。

建议使用DistributedDataParallel, 而不是DataParallel进行多 GPU 训练，即使只有一个节点。

之间的区别DistributedDataParallel和DataParallel是：DistributedDataParallel使用 multiprocessing，其中为每个 GPU 创建一个进程，而DataParallel使用多线程。通过使用 multiprocessing，每个 GPU 都有其专用的进程，这避免了导致的性能开销由 Python 解释器的 GIL 提供。

如果您使用DistributedDataParallel，您可以使用 torch.distributed.launch 实用程序启动您的程序，请参阅第三方后端。

CUDA 图¶

CUDA 图是 CUDA 流及其依赖流执行。有关底层 CUDA API 的一般原则和详细信息，请参阅 CUDA 图形入门和 CUDA C 编程指南的图形部分。

PyTorch 支持使用流捕获构建 CUDA 图形，它将 捕获模式下的 CUDA 流。颁发给捕获流的 CUDA 工作实际上并没有在 GPU 上运行。相反，工作记录在图表中。

捕获后，可以启动图形以根据需要多次运行 GPU 工作。每个重放都运行具有相同参数的相同内核。对于指针参数 this 表示使用相同的内存地址。通过在每次重放之前用新数据（例如，来自新批次）填充 input memory，您可以对新数据重新运行相同的工作。

为什么选择 CUDA 图形？¶

重放图形会牺牲典型预先执行的动态灵活性，以换取大大降低的 CPU 开销。图形的参数和内核是固定的，因此图形重放跳过参数设置和内核调度的所有层，包括 Python、C++ 和 CUDA 驱动程序开销。在后台，重放将整个图形的工作提交给 GPU，其中包含对 cudaGraphLaunch 的一次调用。重放中的 kernel 执行速度也略快一些在 GPU 上，但消除 CPU 开销是主要好处。

如果您的网络的全部或部分是图形安全的（通常这意味着 static shapes 和 Static Control flow，但请参阅其他约束）并且您怀疑它的运行时间至少在某种程度上受到 CPU 的限制。

PyTorch API¶

警告

此 API 目前处于测试阶段，在未来版本中可能会更改。

PyTorch 通过原始torch.cuda.CUDAGraph类和两个便利包装器，torch.cuda.graph和torch.cuda.make_graphed_callables.

torch.cuda.graph是一个简单、多功能的上下文管理器，它在其上下文中捕获 CUDA 工作。在捕获之前，通过运行一些急切的迭代。预热必须在侧流上进行。因为图形在每个 replay 中，您必须维护对持有捕获期间的输入和输出数据。要在新的输入数据上运行图形，请将新数据复制到捕获的输入张量，重放图形，然后从捕获的输出张量中读取新输出。例：

g = torch.cuda.CUDAGraph()

# Placeholder input used for capture
static_input = torch.empty((5,), device="cuda")

# Warmup before capture
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for _ in range(3):
        static_output = static_input * 2
torch.cuda.current_stream().wait_stream(s)

# Captures the graph
# To allow capture, automatically sets a side stream as the current stream in the context
with torch.cuda.graph(g):
    static_output = static_input * 2

# Fills the graph's input memory with new data to compute on
static_input.copy_(torch.full((5,), 3, device="cuda"))
g.replay()
# static_output holds the results
print(static_output)  # full of 3 * 2 = 6

# Fills the graph's input memory with more data to compute on
static_input.copy_(torch.full((5,), 4, device="cuda"))
g.replay()
print(static_output)  # full of 4 * 2 = 8

请参阅全网络捕获、 torch.cuda.amp 的用法和多个流的用法，了解真实和高级模式。

make_graphed_callables则更加复杂。make_graphed_callables接受 Python 函数和torch.nn.Modules.对于每个传递的函数或 Module，它创建 forward-pass 和 backward-pass 工作的单独图形。请参阅部分网络捕获。

约束¶

如果一组作不违反以下任何约束，则它是可捕获的。

约束适用于torch.cuda.graphcontext 和 all 在 forward 和 backward pass 中工作你传递给torch.cuda.make_graphed_callables().

违反上述任何一项都可能导致运行时错误：

捕获必须在非默认流上进行。（如果您使用 rawCUDAGraph.capture_begin和CUDAGraph.capture_end调用。graph和make_graphed_callables()为您设置 Side Stream。
禁止将 CPU 与 GPU 同步的作（例如，调用）。.item()
允许 CUDA RNG作，当使用多个torch.Generator图中的实例，它们必须在 Graph Capture 之前使用进行注册。避免使用和捕获期间; 相反，请使用CUDAGraph.register_generator_stateGenerator.get_stateGenerator.set_stateGenerator.graphsafe_set_state和Generator.graphsafe_get_state用于在 Graph 上下文中安全地管理生成器状态。这确保了 CUDA 图中正确的 RNG作和生成器管理。

违反其中任何一项都可能导致无提示数字错误或未定义的行为：

在一个流程中，一次只能进行一次捕获。
在捕获过程中，任何未捕获的 CUDA 工作都不能在此进程（在任何线程上）运行。
未捕获 CPU 工作。如果捕获的运算包括 CPU 工作，则该工作将在重放期间被忽略。
每次重放都会读取和写入相同的（虚拟）内存地址。
禁止动态控制流（基于 CPU 或 GPU 数据）。
禁止使用动态形状。该图假定捕获的运算序列中的每个张量在每次重播中具有相同的大小和布局。
允许在捕获中使用多个流，但存在限制。

非约束¶

捕获后，可以在任何流上重放图形。

全网捕获¶

如果您的整个网络都是可捕获的，则可以捕获并重放整个迭代：

N, D_in, H, D_out = 640, 4096, 2048, 1024
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
                            torch.nn.Dropout(p=0.2),
                            torch.nn.Linear(H, D_out),
                            torch.nn.Dropout(p=0.1)).cuda()
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Placeholders used for capture
static_input = torch.randn(N, D_in, device='cuda')
static_target = torch.randn(N, D_out, device='cuda')

# warmup
# Uses static_input and static_target here for convenience,
# but in a real setting, because the warmup includes optimizer.step()
# you must use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        y_pred = model(static_input)
        loss = loss_fn(y_pred, static_target)
        loss.backward()
        optimizer.step()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
# Sets grads to None before capture, so backward() will create
# .grad attributes with allocations from the graph's private pool
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    static_y_pred = model(static_input)
    static_loss = loss_fn(static_y_pred, static_target)
    static_loss.backward()
    optimizer.step()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    # Fills the graph's input memory with new data to compute on
    static_input.copy_(data)
    static_target.copy_(target)
    # replay() includes forward, backward, and step.
    # You don't even need to call optimizer.zero_grad() between iterations
    # because the captured backward refills static .grad tensors in place.
    g.replay()
    # Params have been updated. static_y_pred, static_loss, and .grad
    # attributes hold values from computing on this iteration's data.

部分网络捕获¶

如果您的某些网络无法安全捕获（例如，由于动态控制流、 dynamic shapes、CPU syncs 或基本的 CPU 端逻辑），你可以运行 unsafe 零件和使用torch.cuda.make_graphed_callables()仅绘制图表捕获安全部分。

默认情况下，由make_graphed_callables()是 autograd 感知的，并且可以在训练循环中用作直接替换对于函数，或者nn.Module你通过了。

make_graphed_callables()内部创建CUDAGraph对象、运行预热迭代并维护 static 输入和输出。因此（与torch.cuda.graph），则无需手动处理这些作。

在以下示例中，数据依赖型动态控制流表示网络不是端到端的，但make_graphed_callables()让我们以图形的形式捕获和运行图形安全的部分：

N, D_in, H, D_out = 640, 4096, 2048, 1024

module1 = torch.nn.Linear(D_in, H).cuda()
module2 = torch.nn.Linear(H, D_out).cuda()
module3 = torch.nn.Linear(H, D_out).cuda()

loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(chain(module1.parameters(),
                                  module2.parameters(),
                                  module3.parameters()),
                            lr=0.1)

# Sample inputs used for capture
# requires_grad state of sample inputs must match
# requires_grad state of real inputs each callable will see.
x = torch.randn(N, D_in, device='cuda')
h = torch.randn(N, H, device='cuda', requires_grad=True)

module1 = torch.cuda.make_graphed_callables(module1, (x,))
module2 = torch.cuda.make_graphed_callables(module2, (h,))
module3 = torch.cuda.make_graphed_callables(module3, (h,))

real_inputs = [torch.rand_like(x) for _ in range(10)]
real_targets = [torch.randn(N, D_out, device="cuda") for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    optimizer.zero_grad(set_to_none=True)

    tmp = module1(data)  # forward ops run as a graph

    if tmp.sum().item() > 0:
        tmp = module2(tmp)  # forward ops run as a graph
    else:
        tmp = module3(tmp)  # forward ops run as a graph

    loss = loss_fn(tmp, target)
    # module2's or module3's (whichever was chosen) backward ops,
    # as well as module1's backward ops, run as graphs
    loss.backward()
    optimizer.step()

与 torch.cuda.amp 一起使用¶

对于典型的优化器，同步 CPU 和 GPU，这在捕获过程中是禁止的。为避免错误，请使用 partial-network capture，或者（如果 forward、loss、和 backward 是捕获安全的）capture forward、loss 和 backward，但不是优化器步骤：GradScaler.step

# warmup
# In a real setting, use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        with torch.cuda.amp.autocast():
            y_pred = model(static_input)
            loss = loss_fn(y_pred, static_target)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    with torch.cuda.amp.autocast():
        static_y_pred = model(static_input)
        static_loss = loss_fn(static_y_pred, static_target)
    scaler.scale(static_loss).backward()
    # don't capture scaler.step(optimizer) or scaler.update()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    static_input.copy_(data)
    static_target.copy_(target)
    g.replay()
    # Runs scaler.step and scaler.update eagerly
    scaler.step(optimizer)
    scaler.update()

使用多个流¶

捕获模式会自动传播到与捕获流同步的任何流。在 capture 中，您可以通过向不同的流发出调用来公开并行性。但整体流依赖项 DAG 必须从初始捕获流捕获开始后重新加入初始流捕获结束前：

with torch.cuda.graph(g):
    # at context manager entrance, torch.cuda.current_stream()
    # is the initial capturing stream

    # INCORRECT (does not branch out from or rejoin initial stream)
    with torch.cuda.stream(s):
        cuda_work()

    # CORRECT:
    # branches out from initial stream
    s.wait_stream(torch.cuda.current_stream())
    with torch.cuda.stream(s):
        cuda_work()
    # rejoins initial stream before capture ends
    torch.cuda.current_stream().wait_stream(s)

注意

为避免高级用户在 nsight systems 或 nvprof 中查看重播时感到困惑：与 Eager Execution 不同，该图在 capture 中解释非平凡流 DAG 作为提示，而不是命令。在重放期间，图形可能会重新组织独立的 operations 添加到不同的流中，或者以不同的顺序将它们排入队列（同时尊重原始 DAG 的总体依赖项）。

与 DistributedDataParallel 一起使用¶

NCCL < 2.9.6¶

2.9.6 之前的 NCCL 版本不允许捕获 Collectives。您必须使用 partial-network capture，这会推迟 AllReduces 发生在 backward 的图形部分之外。

叫make_graphed_callables()在使用 DDP 包装网络之前，在可绘制的网络部分上。

NCCL >= 2.9.6¶

NCCL 版本 2.9.6 或更高版本允许在图形中使用集合。捕获整个向后传递的方法是一个可行的选择，但需要三个设置步骤。

禁用 DDP 的内部异步错误处理：

os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "0"
torch.distributed.init_process_group(...)

在完全向后捕获之前，必须在旁流上下文中构造 DDP：

with torch.cuda.stream(s):
    model = DistributedDataParallel(model)

在捕获之前，您的预热必须运行至少 11 次启用 DDP 的预先迭代。

图形内存管理¶

捕获的图形每次重放时都会作用于相同的虚拟地址。如果 PyTorch 释放内存，则以后的重放可能会触发非法内存访问。如果 PyTorch 将内存重新分配给新的张量，则重放可能会损坏值被这些张量看到。因此，图形使用的虚拟地址必须为为跨重播的图形保留。PyTorch 缓存分配器实现了这一点通过检测捕获何时进行并满足捕获的分配从 Graph 私有内存池。私有池将保持活动状态，直到其CUDAGraph对象和捕获期间创建的所有张量超出范围。

私有池会自动维护。默认情况下，分配器会创建一个每个捕获都有单独的私有池。如果您捕获多个图表，这种保守的方法确保 Graph Replay 永远不会破坏彼此的值，但有时会不必要地浪费内存。

CUDA 语义¶

Ampere（及更高版本）设备上的 TensorFloat-32 （TF32）¶

FP16 GEMM 的精度降低¶

BF16 GEMM 的精度降低¶

异步执行¶

CUDA 流¶

向后传递的流语义¶

BC 说明：在默认流上使用 grads¶

内存管理¶

优化内存使用`PYTORCH_CUDA_ALLOC_CONF`¶

使用 CUDA 的自定义内存分配器¶

cuBLAS 工作区¶

cuFFT 计划缓存¶

Just-in-Time 编译¶

最佳实践¶

与设备无关的代码¶

使用固定的内存缓冲区¶

使用 nn.parallel.DistributedDataParallel 而不是 multiprocessing 或 nn.DataParallel 数据并行¶

CUDA 图¶

为什么选择 CUDA 图形？¶

PyTorch API¶

约束¶

非约束¶

全网捕获¶

部分网络捕获¶

与 torch.cuda.amp 一起使用¶

使用多个流¶

与 DistributedDataParallel 一起使用¶

NCCL < 2.9.6¶

NCCL >= 2.9.6¶

图形内存管理¶

文档

教程

资源

CUDA 语义¶

Ampere（及更高版本）设备上的 TensorFloat-32 （TF32）¶

FP16 GEMM 的精度降低¶

BF16 GEMM 的精度降低¶

异步执行¶

CUDA 流¶

向后传递的流语义¶

BC 说明：在默认流上使用 grads¶

内存管理¶

优化内存使用PYTORCH_CUDA_ALLOC_CONF¶

使用 CUDA 的自定义内存分配器¶

cuBLAS 工作区¶

cuFFT 计划缓存¶

Just-in-Time 编译¶

最佳实践¶

与设备无关的代码¶

使用固定的内存缓冲区¶

使用 nn.parallel.DistributedDataParallel 而不是 multiprocessing 或 nn.DataParallel 数据并行¶

CUDA 图¶

为什么选择 CUDA 图形？¶

PyTorch API¶

约束¶

非约束¶

全网捕获¶

部分网络捕获¶

与 torch.cuda.amp 一起使用¶

使用多个流¶

与 DistributedDataParallel 一起使用¶

NCCL < 2.9.6¶

NCCL >= 2.9.6¶

图形内存管理¶

在捕获之间共享内存¶

文档

教程

资源

优化内存使用`PYTORCH_CUDA_ALLOC_CONF`¶