CUDA 自动混合精度示例¶

通常，“自动混合精度训练”是指同时使用 torch.autocast 和 torch.cuda.amp.GradScaler 进行训练。

Instances of torch.autocast 使选定区域启用自动类型转换。自动类型转换会自动为GPU操作选择精度以提高性能同时保持准确性。

Instances of torch.cuda.amp.GradScaler 帮助方便地执行梯度缩放的步骤。梯度缩放通过最小化梯度下溢来提高具有 float16 梯度的网络的收敛性，如此处所解释。

torch.autocast 和 torch.cuda.amp.GradScaler 是模块化的。在下面的示例中，每个都按照其单独的文档说明使用。

（此处的示例仅供参考。请参阅自动混合精度配方以获取可运行的演练。）

典型的混合精度训练 ¶

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()

        # Runs the forward pass with autocasting.
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        # Backward passes under autocast are not recommended.
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        scaler.scale(loss).backward()

        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

所有由scaler.scale(loss).backward()生成的梯度都被缩放了。如果你想在backward()和scaler.step(optimizer)之间修改或检查参数的.grad属性，你应该首先取消缩放它们。例如，梯度裁剪会调整一组梯度，使它们的全局范数（见torch.nn.utils.clip_grad_norm_()）或最大幅度（见torch.nn.utils.clip_grad_value_()）达到某个用户设定的阈值 $<=$ 。如果你尝试在不取消缩放的情况下进行裁剪，梯度的范数/最大幅度也会被缩放，因此你请求的阈值（本应是针对未缩放梯度的阈值）将无效。

scaler.unscale_(optimizer) 对 optimizer 分配的参数持有的梯度进行缩放。如果你的模型或其他模型包含分配给另一个优化器的其他参数（比如 optimizer2），你也可以单独调用 scaler.unscale_(optimizer2) 来对这些参数的梯度进行缩放。

梯度裁剪 ¶

在裁剪之前调用 scaler.unscale_(optimizer) 可以让你像往常一样裁剪未缩放的梯度：

scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(optimizer)

        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

        # optimizer's gradients are already unscaled, so scaler.step does not unscale them,
        # although it still skips optimizer.step() if the gradients contain infs or NaNs.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

scaler 条记录表明 scaler.unscale_(optimizer) 在本次迭代中已经为该优化器调用过，因此 scaler.step(optimizer) 知道在（内部）调用 optimizer.step() 之前不需要重复地取消缩放梯度。

警告

unscale_ 每个优化器每次调用 step 时应仅调用一次，并且只有在该优化器分配的所有参数的梯度累积完成后才能调用。在每次 step 调用之间对给定优化器调用 unscale_ 两次会触发 RuntimeError。

处理缩放梯度 ¶

梯度累积 ¶

梯度累积在有效批量大小为batch_per_iter * iters_to_accumulate（如果分布式则为* num_procs）的范围内添加梯度。缩放应针对有效批量进行校准，这意味着需要进行inf/NaN检查，如果发现inf/NaN梯度则跳过步骤，并且应在有效批量粒度级别上进行缩放更新。此外，在给定的有效批量梯度累积过程中，梯度应保持缩放状态，缩放因子应保持不变。如果在累积完成之前对梯度进行反缩放（或改变缩放因子），那么下一个反向传播将把缩放后的梯度加到未缩放的梯度上（或由不同因子缩放的梯度上），之后就无法恢复累积的未缩放梯度step必须应用。

因此，如果你想 unscale_ 梯度（例如，允许裁剪未缩放的梯度），在调用 step 之前立即调用 unscale_，在所有即将进行的 step 的（已缩放）梯度累积之后。另外，仅在迭代结束时调用 update，其中你为一个完整的有效批次调用了 step：

scaler = GradScaler()

for epoch in epochs:
    for i, (input, target) in enumerate(data):
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)
            loss = loss / iters_to_accumulate

        # Accumulates scaled gradients.
        scaler.scale(loss).backward()

        if (i + 1) % iters_to_accumulate == 0:
            # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)

            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

梯度惩罚 ¶

梯度惩罚的实现通常使用 torch.autograd.grad() 创建梯度，将它们组合起来创建惩罚值，并将惩罚值添加到损失中。

这是一个普通的L2惩罚示例，不包含梯度缩放或自动类型转换：

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)

        # Creates gradients
        grad_params = torch.autograd.grad(outputs=loss,
                                          inputs=model.parameters(),
                                          create_graph=True)

        # Computes the penalty term and adds it to the loss
        grad_norm = 0
        for grad in grad_params:
            grad_norm += grad.pow(2).sum()
        grad_norm = grad_norm.sqrt()
        loss = loss + grad_norm

        loss.backward()

        # clip gradients here, if desired

        optimizer.step()

要实现带有梯度缩放的梯度惩罚，传递给torch.autograd.grad()的outputs个张量应进行缩放。因此，生成的梯度也会被缩放，在组合以创建惩罚值之前，应该先将其反向缩放。

此外，惩罚项的计算是前向传播的一部分，因此应该位于 autocast 上下文中。

这就是相同L2惩罚的样子：

scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)

        # Scales the loss for autograd.grad's backward pass, producing scaled_grad_params
        scaled_grad_params = torch.autograd.grad(outputs=scaler.scale(loss),
                                                 inputs=model.parameters(),
                                                 create_graph=True)

        # Creates unscaled grad_params before computing the penalty. scaled_grad_params are
        # not owned by any optimizer, so ordinary division is used instead of scaler.unscale_:
        inv_scale = 1./scaler.get_scale()
        grad_params = [p * inv_scale for p in scaled_grad_params]

        # Computes the penalty term and adds it to the loss
        with autocast(device_type='cuda', dtype=torch.float16):
            grad_norm = 0
            for grad in grad_params:
                grad_norm += grad.pow(2).sum()
            grad_norm = grad_norm.sqrt()
            loss = loss + grad_norm

        # Applies scaling to the backward call as usual.
        # Accumulates leaf gradients that are correctly scaled.
        scaler.scale(loss).backward()

        # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)

        # step() and update() proceed as usual.
        scaler.step(optimizer)
        scaler.update()

处理多个模型、损失函数和优化器 ¶

如果你的网络有多个损失函数，你必须分别对每个损失函数调用 scaler.scale。如果你的网络有多个优化器，你可以单独调用 scaler.unscale_ 中的任何一个，并且你必须分别对每个优化器调用 scaler.step。

然而，scaler.update 应该只调用一次，在所有优化器完成本次迭代后：

scaler = torch.cuda.amp.GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer0.zero_grad()
        optimizer1.zero_grad()
        with autocast(device_type='cuda', dtype=torch.float16):
            output0 = model0(input)
            output1 = model1(input)
            loss0 = loss_fn(2 * output0 + 3 * output1, target)
            loss1 = loss_fn(3 * output0 - 5 * output1, target)

        # (retain_graph here is unrelated to amp, it's present because in this
        # example, both backward() calls share some sections of graph.)
        scaler.scale(loss0).backward(retain_graph=True)
        scaler.scale(loss1).backward()

        # You can choose which optimizers receive explicit unscaling, if you
        # want to inspect or modify the gradients of the params they own.
        scaler.unscale_(optimizer0)

        scaler.step(optimizer0)
        scaler.step(optimizer1)

        scaler.update()

每个优化器都会检查其梯度是否存在无穷大或NaN值，并独立决定是否跳过该步骤。这可能导致一个优化器跳过该步骤，而另一个则不跳过。由于跳过步骤的情况很少发生（每几百次迭代一次），这不应妨碍收敛。如果您在向多优化器模型添加梯度缩放后观察到收敛效果不佳，请报告一个错误。

使用多个GPU进行工作 ¶

此处描述的问题仅影响 autocast。 GradScaler 的使用方式保持不变。

在单个进程中使用DataParallel ¶

即使 torch.nn.DataParallel 生成线程在每个设备上运行前向传递。自动转换状态会在每个线程中传播，以下代码将正常工作：

model = MyModel()
dp_model = nn.DataParallel(model)

# Sets autocast in the main thread
with autocast(device_type='cuda', dtype=torch.float16):
    # dp_model's internal threads will autocast.
    output = dp_model(input)
    # loss_fn also autocast
    loss = loss_fn(output)

分布式数据并行处理，每个进程一个GPU ¶

torch.nn.parallel.DistributedDataParallel’s documentation recommends one GPU per process for best performance. In this case, DistributedDataParallel does not spawn threads internally, so usages of autocast and GradScaler are not affected.

分布式数据并行处理，每个进程使用多个GPU ¶

这里 torch.nn.parallel.DistributedDataParallel 可能在每个设备上启动一个辅助线程来运行前向传播，像 torch.nn.DataParallel 一样。修复方法相同：将自动转换应用为模型的 forward 方法的一部分，以确保在辅助线程中启用它。

自动混合精度和自定义自动梯度函数 ¶

如果您的网络使用自定义 autograd 函数（torch.autograd.Function 的子类），则需要进行更改以实现与自动转换的兼容性，如果任何函数

接受多个浮点Tensor输入，
封装任何可自动转换的操作（请参阅自动转换操作参考），或
需要特定的dtype（例如，如果它封装了 CUDA 扩展这些扩展仅针对dtype进行了编译）。

在所有情况下，如果你正在导入该函数并且无法更改其定义，则一个安全的回退方法是在出现错误的任何使用点禁用自动转换并强制执行 float32（或 dtype）：

with autocast(device_type='cuda', dtype=torch.float16):
    ...
    with autocast(device_type='cuda', dtype=torch.float16, enabled=False):
        output = imported_function(input1.float(), input2.float())

如果你是函数的作者（或可以修改其定义），一个更好的解决方案是使用 torch.cuda.amp.custom_fwd() 和 torch.cuda.amp.custom_bwd() 装饰器，如下所示的相关情况。

具有多个输入或可自动转换操作的函数 ¶

应用 custom_fwd 和 custom_bwd（不带参数）到 forward 和 backward 分别。这些确保 forward 在当前的自动类型转换状态下执行，而 backward 在与 forward 相同的自动类型转换状态下执行（这可以防止类型不匹配错误）：

class MyMM(torch.autograd.Function):
    @staticmethod
    @custom_fwd
    def forward(ctx, a, b):
        ctx.save_for_backward(a, b)
        return a.mm(b)
    @staticmethod
    @custom_bwd
    def backward(ctx, grad):
        a, b = ctx.saved_tensors
        return grad.mm(b.t()), a.t().mm(grad)

现在 MyMM 可以在任何地方调用，而无需禁用自动类型转换或手动转换输入：

mymm = MyMM.apply

with autocast(device_type='cuda', dtype=torch.float16):
    output = mymm(input1, input2)

需要特定`dtype`的函数 ¶

考虑一个需要 torch.float32 个输入的自定义函数。将 custom_fwd(cast_inputs=torch.float32) 应用于 forward 并将 custom_bwd（无参数）应用于 backward。如果 forward 在启用了自动类型转换的区域中运行，装饰器会将浮点 CUDA 张量输入转换为 float32，并在 forward 和 backward 期间本地禁用自动类型转换：

class MyFloat32Func(torch.autograd.Function):
    @staticmethod
    @custom_fwd(cast_inputs=torch.float32)
    def forward(ctx, input):
        ctx.save_for_backward(input)
        ...
        return fwd_output
    @staticmethod
    @custom_bwd
    def backward(ctx, grad):
        ...

现在 MyFloat32Func 可以在任何地方调用，无需手动禁用自动转换或转换输入：

func = MyFloat32Func.apply

with autocast(device_type='cuda', dtype=torch.float16):
    # func will run in float32, regardless of the surrounding autocast state
    output = func(input)