CUDA 自动混合精度示例¶

通常，“自动混合精度训练”是指使用torch.autocast和torch.cuda.amp.GradScaler一起。

的实例torch.autocast为所选区域启用自动转换。自动转换会自动选择 GPU作的精度以提高性能同时保持准确性。

的实例torch.cuda.amp.GradScaler帮助执行梯度缩放。梯度缩放通过最小化梯度下溢来改善具有梯度的网络的收敛性，如此处所述。float16

torch.autocast和torch.cuda.amp.GradScaler是模块化的。在下面的示例中，每个示例都按照其单独的文档建议使用。

（此处的示例是说明性的。请参阅 Automatic Mixed Precision 配方，了解可运行的演练。

典型的混合精度训练 ¶

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()

        # Runs the forward pass with autocasting.
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        # Backward passes under autocast are not recommended.
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        scaler.scale(loss).backward()

        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

生成的所有渐变都会被缩放。如果您想修改或检查参数的属性介于和之间，您应该首先取消缩放。例如，渐变裁剪会处理一组渐变，以便它们的全局范数（参见scaler.scale(loss).backward().gradbackward()scaler.step(optimizer)torch.nn.utils.clip_grad_norm_()）或最大幅度（参见torch.nn.utils.clip_grad_value_()) 是 $<=$ 一些用户施加的阈值。如果您尝试在不取消缩放的情况下进行裁剪，则梯度的 norm/maximum magnitude 也会被缩放，因此您请求的阈值（本应是未缩放梯度的阈值）将无效。

scaler.unscale_(optimizer)取消分配的参数所持有的渐变的缩放比例。如果您的一个或多个模型包含已分配给其他优化器的其他参数（比如），您可以单独调用以取消缩放这些 parameters 的梯度。optimizeroptimizer2scaler.unscale_(optimizer2)

渐变剪裁 ¶

在裁剪之前调用可以像往常一样裁剪未缩放的渐变：scaler.unscale_(optimizer)

scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(optimizer)

        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

        # optimizer's gradients are already unscaled, so scaler.step does not unscale them,
        # although it still skips optimizer.step() if the gradients contain infs or NaNs.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

scaler已为此优化器调用的记录这次迭代，所以知道之前不要冗余地取消缩放渐变（内部）调用 .scaler.unscale_(optimizer)scaler.step(optimizer)optimizer.step()

警告

unscale_每个优化器只应调用 1 次step叫并且仅在该优化器的 assign parameters 的所有梯度都已累积之后。叫unscale_对于给定的优化器，每个step触发 RuntimeError 的 RuntimeError 触发。

使用缩放的渐变 ¶

梯度累积 ¶

梯度累积在有效大小（如果分布）上添加梯度。应针对有效批次校准秤，这意味着 inf/NaN 检查，如果找到 INF/NaN 等级，则跳过步骤，并且规模更新应以有效批次粒度进行。此外，grads 应保持比例，并且比例因子应保持不变，而给定 effective 的 grads batch 的 Batch。如果在累加完成之前，grads 未缩放（或缩放因子发生变化），则下一个向后传递会将缩放的 grads 添加到未缩放的 grads（或按不同因子缩放的 grads）在此之后，将无法恢复累积的未缩放等级batch_per_iter * iters_to_accumulate* num_procsstep必须申请。

因此，如果您想unscale_grads（例如，允许裁剪未缩放的 grads），叫unscale_就在之前step，毕竟即将到来的（缩放的） gradsstep已经积累。此外，只需调用update在迭代结束时您调用的位置step对于完整有效的批次：

scaler = GradScaler()

for epoch in epochs:
    for i, (input, target) in enumerate(data):
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)
            loss = loss / iters_to_accumulate

        # Accumulates scaled gradients.
        scaler.scale(loss).backward()

        if (i + 1) % iters_to_accumulate == 0:
            # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)

            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

梯度惩罚 ¶

梯度惩罚实现通常使用torch.autograd.grad()，将它们组合起来以创建罚值并将罚值添加到损失中。

下面是一个没有梯度缩放或自动转换的 L2 惩罚的常见示例：

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)

        # Creates gradients
        grad_params = torch.autograd.grad(outputs=loss,
                                          inputs=model.parameters(),
                                          create_graph=True)

        # Computes the penalty term and adds it to the loss
        grad_norm = 0
        for grad in grad_params:
            grad_norm += grad.pow(2).sum()
        grad_norm = grad_norm.sqrt()
        loss = loss + grad_norm

        loss.backward()

        # clip gradients here, if desired

        optimizer.step()

要使用梯度缩放实现梯度惩罚，Tensor 传递给outputstorch.autograd.grad()应该缩放。生成的渐变）将被缩放，并且在合并以创建 penalty 值。

此外，罚项计算是向前传球的一部分，因此应该是在autocast上下文。

以下是相同的 L2 处罚：

scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)

        # Scales the loss for autograd.grad's backward pass, producing scaled_grad_params
        scaled_grad_params = torch.autograd.grad(outputs=scaler.scale(loss),
                                                 inputs=model.parameters(),
                                                 create_graph=True)

        # Creates unscaled grad_params before computing the penalty. scaled_grad_params are
        # not owned by any optimizer, so ordinary division is used instead of scaler.unscale_:
        inv_scale = 1./scaler.get_scale()
        grad_params = [p * inv_scale for p in scaled_grad_params]

        # Computes the penalty term and adds it to the loss
        with autocast(device_type='cuda', dtype=torch.float16):
            grad_norm = 0
            for grad in grad_params:
                grad_norm += grad.pow(2).sum()
            grad_norm = grad_norm.sqrt()
            loss = loss + grad_norm

        # Applies scaling to the backward call as usual.
        # Accumulates leaf gradients that are correctly scaled.
        scaler.scale(loss).backward()

        # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)

        # step() and update() proceed as usual.
        scaler.step(optimizer)
        scaler.update()

使用多个模型、损失和优化器 ¶

如果您的网络多次丢失，则必须调用scaler.scale在他们每个人身上。如果您的网络有多个优化器，您可以调用scaler.unscale_在他们中的任何一个上，并且必须调用scaler.step在他们每个人身上。

然而scaler.update应该只调用一次，在使用的所有优化器之后，此迭代已步进：

scaler = torch.cuda.amp.GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer0.zero_grad()
        optimizer1.zero_grad()
        with autocast(device_type='cuda', dtype=torch.float16):
            output0 = model0(input)
            output1 = model1(input)
            loss0 = loss_fn(2 * output0 + 3 * output1, target)
            loss1 = loss_fn(3 * output0 - 5 * output1, target)

        # (retain_graph here is unrelated to amp, it's present because in this
        # example, both backward() calls share some sections of graph.)
        scaler.scale(loss0).backward(retain_graph=True)
        scaler.scale(loss1).backward()

        # You can choose which optimizers receive explicit unscaling, if you
        # want to inspect or modify the gradients of the params they own.
        scaler.unscale_(optimizer0)

        scaler.step(optimizer0)
        scaler.step(optimizer1)

        scaler.update()

每个优化器都会检查其梯度中的 infs/NaN 并做出独立决策是否跳过该步骤。这可能会导致一个优化器跳过该步骤而另一个则没有。由于跳过步骤很少发生（每几百次迭代）这不应妨碍趋同。如果在添加梯度缩放后观察到收敛性较差对于多优化器模型，请报告 bug。

使用多个 GPU ¶

此处描述的问题仅影响autocast.GradScaler的使用情况保持不变。

DataParallel 在单个进程中 ¶

便torch.nn.DataParallel生成线程以在每个设备上运行前向传递。 autocast 状态在每个 state 中传播，并且以下内容将起作用：

model = MyModel()
dp_model = nn.DataParallel(model)

# Sets autocast in the main thread
with autocast(device_type='cuda', dtype=torch.float16):
    # dp_model's internal threads will autocast.
    output = dp_model(input)
    # loss_fn also autocast
    loss = loss_fn(output)

DistributedDataParallel，每个进程一个 GPU ¶

torch.nn.parallel.DistributedDataParallel的文档建议每个进程一个 GPU 以获得最佳效果性能。在这种情况下，不会在内部生成线程，所以的用法DistributedDataParallelautocast和GradScaler不受影响。

DistributedDataParallel，每个进程多个 GPU ¶

这里torch.nn.parallel.DistributedDataParallel可以生成一个侧线程来在每个 device 的torch.nn.DataParallel.解决方法是相同的：将 autocast 作为模型方法的一部分应用，以确保在侧线程中启用它。forward

Autocast 和自定义 Autograd 函数 ¶

如果您的网络使用自定义 autograd 函数（torch.autograd.Function），则需要更改 autocast 兼容性（如果有函数）

接受多个浮点 Tensor 输入，
包装任何可自动浇铸的作（请参阅自动转换作参考），或
需要一个特定的（例如，如果它包装了仅为编译的 CUDA 扩展）。dtypedtype

在所有情况下，如果您正在导入函数并且无法更改其定义，则安全的是在发生错误的任何使用点禁用 autocast 并强制执行（）或）：float32dtype

with autocast(device_type='cuda', dtype=torch.float16):
    ...
    with autocast(device_type='cuda', dtype=torch.float16, enabled=False):
        output = imported_function(input1.float(), input2.float())

如果你是函数的作者（或者可以更改其定义），更好的解决方案是使用torch.cuda.amp.custom_fwd()和torch.cuda.amp.custom_bwd()装饰器，如相关案例如下。

具有多个输入或可自动cast作的函数 ¶

应用custom_fwd和custom_bwd（不带参数）分别更改为和。这些确保以当前自动转换状态执行，并以相同的自动转换状态执行（这可以防止类型不匹配错误）：forwardbackwardforwardbackwardforward

class MyMM(torch.autograd.Function):
    @staticmethod
    @custom_fwd
    def forward(ctx, a, b):
        ctx.save_for_backward(a, b)
        return a.mm(b)
    @staticmethod
    @custom_bwd
    def backward(ctx, grad):
        a, b = ctx.saved_tensors
        return grad.mm(b.t()), a.t().mm(grad)

现在可以在任何地方调用，而无需禁用自动转换或手动转换输入：MyMM

mymm = MyMM.apply

with autocast(device_type='cuda', dtype=torch.float16):
    output = mymm(input1, input2)

需要特定`dtype`¶

考虑一个需要输入的自定义函数。应用torch.float32custom_fwd(cast_inputs=torch.float32)to 和forwardcustom_bwd（不带参数）设置为 . 如果在启用自动转换的区域中运行，则装饰器会强制转换浮点 CUDA 张量的输入，并在和期间在本地禁用自动转换：backwardforwardfloat32forwardbackward

class MyFloat32Func(torch.autograd.Function):
    @staticmethod
    @custom_fwd(cast_inputs=torch.float32)
    def forward(ctx, input):
        ctx.save_for_backward(input)
        ...
        return fwd_output
    @staticmethod
    @custom_bwd
    def backward(ctx, grad):
        ...

现在可以在任何地方调用，而无需手动禁用自动转换或强制转换输入：MyFloat32Func

func = MyFloat32Func.apply

with autocast(device_type='cuda', dtype=torch.float16):
    # func will run in float32, regardless of the surrounding autocast state
    output = func(input)