量化¶

警告

Quantization 目前处于测试阶段，可能会发生变化。

量化简介¶

量化是指执行计算和存储的技术 bitwidth 低于浮点精度的张量。量化模型对具有整数的张量执行部分或全部作，而不是浮点值。这允许更紧凑的模型表示和在许多硬件平台上使用高性能矢量化作。与典型的 FP32 模型相比，PyTorch 支持 INT8 量化，允许模型大小减小 4 倍，内存带宽减少 4 倍要求。INT8 计算的硬件支持通常为 2 到 4 比 FP32 计算快倍。量化主要是一种技术加快推理速度，量化仅支持前向传递运维。

PyTorch 支持多种量化深度学习模型的方法。在大多数情况下，模型在 FP32 中训练，然后将模型转换为 INT8 的。此外，PyTorch 还支持量化感知训练，这使用 fake-quantization 模块。请注意，整个计算是在浮点。在量化感知训练结束时，PyTorch 提供 conversion 函数将训练后的模型转换为较低的精度。

在较低级别，PyTorch 提供了一种表示量化张量和使用它们执行作。它们可用于直接构建模型以较低的精度执行全部或部分计算。更高级别提供的 API 包含转换 FP32 模型的典型工作流程以最小的精度损失降低精度。

量化要求用户了解三个概念：

量化配置（Qconfig）：指定如何量化权重和激活。需要 Qconfig 来创建量化模型。
Backend：指支持量化的内核，通常具有不同的数值。
量化引擎（torch.backends.quantization.engine）：执行量化模型时，qengine 指定要用于执行的后端。确保 qengine 与 qconfig 一致非常重要。

原生支持的后端¶

目前，PyTorch 支持以下后端，用于高效运行量化算子：

支持 AVX2 或更高版本的 x86 CPU（没有 AVX2，某些作具有低效的实现），通过 FBGEMM （https://github.com/pytorch/FBGEMM）。
ARM CPU（通常位于移动/嵌入式设备中），通过 qnnpack （https://github.com/pytorch/QNNPACK）。

相应的实现是根据 PyTorch 构建模式自动选择的，但用户可以选择通过将 torch.backends.quantization.engine 设置为 fbgemm 或 qnnpack 来覆盖此项。

注意

目前 PyTorch 没有在 CUDA 上提供量化算子实现 - 这是未来工作的方向。将模型移动到 CPU 以测试量化功能。

量化感知训练（通过FakeQuantize, 它模拟 fp32 中的量化数值）支持 CPU 和 CUDA。

在准备量化模型时，需要确保 qconfig ，用于量化计算的引擎与后端匹配模型将被执行。qconfig 控制使用的观察者类型在 Quantization 通道期间。qengine 控制在为线性包装重量时是否使用 fbgemm 或 qnnpack 特定的包装功能以及卷积函数和模块。例如：

fbgemm 的默认设置：

# set the qconfig for PTQ
qconfig = torch.quantization.get_default_qconfig('fbgemm')
# or, set the qconfig for QAT
qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'fbgemm'

qnnpack 的默认设置：

# set the qconfig for PTQ
qconfig = torch.quantization.get_default_qconfig('qnnpack')
# or, set the qconfig for QAT
qconfig = torch.quantization.get_default_qat_qconfig('qnnpack')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'qnnpack'

量化 API 总结¶

PyTorch 提供两种不同的量化模式：Eager Mode Quantization 和 FX Graph Mode Quantization。

Eager Mode Quantization 是一项测试版功能。用户需要进行融合并手动指定量化和反量化发生的位置，而且它只支持模块，不支持函数。

FX Graph Mode Quantization 是 PyTorch 中一个新的自动量化框架，目前它是一个原型功能。它通过添加对泛函的支持和自动化量化过程来改进 Eager Mode Quantization，尽管人们可能需要重构模型以使模型与 FX Graph Mode Quantization（符号可追溯）兼容。请注意，FX Graph Mode Quantization 预计不会适用于任意模型，因为该模型可能无法符号地追溯，我们会将其集成到 torchvision 等域库中，用户将能够量化类似于使用 FX Graph Mode Quantization 支持的域库中的模型。对于任意模型，我们将提供一般准则，但要使其真正工作，用户可能需要熟悉，尤其是如何使模型符号可追溯。torch.fxtorch.fx

建议量化的新用户先尝试 FX Graph 模式量化，如果不起作用，用户可以尝试按照使用 FX Graph 模式量化的指南或回退到 Eager 模式量化。

下表比较了 Eager Mode Quantization 和 FX Graph Mode Quantization 之间的区别：

	Eager 模式量化	FX Graph 模式量化
释放地位	试用版	原型
算子融合	手动	自动
量化/DeQuant 放置	手动	自动
量化模块	支持	支持
量化功能/Torch 老年退休金计划	手动	自动
支持定制	有限支持	完全地支持
量化模式支持	培训后量化：静态、动态、仅重量量化感知训练：静态的	培训后量化：静态、动态、仅重量量化感知训练：静态的
输入/输出型号类型	`torch.nn.Module`	`torch.nn.Module`（可能需要一些重构为模型与 FX 兼容图形模式量化）

支持三种类型的量化：

动态量化（使用读取/存储在浮点和量化进行计算。
静态量化（量化权重、量化激活、校准所需的岗位培训）
静态量化感知训练（量化权重、量化激活、训练期间建模的量化数值）

请参阅我们的 Pytorch 上的量化简介博客文章更全面地概述这些量化之间的权衡类型。

运算符覆盖范围在动态和静态量化之间有所不同，如下表所示。请注意，对于 FX 量化，还支持相应的函数。

	静态的量化	动态量化
nn.线性 nn.卷积 1d/2d/3d	Y Y	Y N
nn.LSTM 系列 nn.格鲁乌	N N	Y Y
nn.RNNCell nn.GRUCell nn.LSTMCell	N N N	Y Y Y
nn.嵌入包	Y（激活数在 FP32 中）	Y
nn.嵌入	Y	N
nn.多头注意	不支持	不支持
激活	广泛支持	未更改、计算待在 FP32

Eager 模式量化¶

动态量化¶

这是最简单的量化形式，其中权重为提前量化，但激活是动态量化的在推理期间。这用于模型执行时间主要从内存中加载权重，而不是计算矩阵乘法。对于 LSTM 和 Transformer 类型的模型，情况是正确的。小批量。

图：

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                 /
linear_weight_fp32

# dynamically quantized model
# linear and LSTM weights are in int8
previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
                     /
   linear_weight_int8

API 示例：

import torch

# define a floating point model
class M(torch.nn.Module):
    def __init__(self):
        super(M, self).__init__()
        self.fc = torch.nn.Linear(4, 4)

    def forward(self, x):
        x = self.fc(x)
        return x

# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.quantization.quantize_dynamic(
    model_fp32,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  # the target dtype for quantized weights

# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)

要了解有关动态量化的更多信息，请参阅我们的动态量化教程。

静态量化¶

静态量化模型的权重和激活。它尽可能将激活融合到前面的层中。它要求使用代表性数据集进行校准以确定最佳量化激活参数。训练后量化通常用于内存带宽和计算节省都很重要，因为 CNN 是一种典型用例。静态量化也称为后训练量化或 PTQ。

图：

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                    /
    linear_weight_fp32

# statically quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                    /
  linear_weight_int8

API 示例：

import torch

# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super(M, self).__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'fbgemm' for server inference and
# 'qnnpack' for mobile inference. Other quantization configurations such
# as selecting symmetric or assymetric quantization and MinMax or L2Norm
# calibration techniques can be specified here.
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')

# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
# Common fusions include `conv + relu` and `conv + batchnorm + relu`
model_fp32_fused = torch.quantization.fuse_modules(model_fp32, [['conv', 'relu']])

# Prepare the model for static quantization. This inserts observers in
# the model that will observe activation tensors during calibration.
model_fp32_prepared = torch.quantization.prepare(model_fp32_fused)

# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

要了解有关静态量化的更多信息，请参阅静态量化教程。

量化感知训练¶

量化感知训练对训练期间量化的效果进行建模与其他量化方法相比，可实现更高的准确性。在 training 中，所有计算都是以浮点形式完成的，具有 fake_quant 个模块通过固定和舍入对量化效果进行建模，以模拟 INT8 的影响。模型转换后，权重和激活被量化，激活被融合到前一层在可能的情况下。它通常与 CNN 一起使用，并产生更高的准确性与静态量化相比。量化感知训练也称为卡塔尔。

图：

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                      /
    linear_weight_fp32

# model with fake_quants for modeling quantization numerics during training
previous_layer_fp32 -- fq -- linear_fp32 -- activation_fp32 -- fq -- next_layer_fp32
                           /
   linear_weight_fp32 -- fq

# quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                     /
   linear_weight_int8

API 示例：

import torch

# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
    def __init__(self):
        super(M, self).__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.bn = torch.nn.BatchNorm2d(1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to train mode for QAT logic to work
model_fp32.train()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'fbgemm' for server inference and
# 'qnnpack' for mobile inference. Other quantization configurations such
# as selecting symmetric or assymetric quantization and MinMax or L2Norm
# calibration techniques can be specified here.
model_fp32.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')

# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.quantization.fuse_modules(model_fp32,
    [['conv', 'bn', 'relu']])

# Prepare the model for QAT. This inserts observers and fake_quants in
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.quantization.prepare_qat(model_fp32_fused)

# run the training loop (not shown)
training_loop(model_fp32_prepared)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, fuses modules where appropriate,
# and replaces key operators with quantized implementations.
model_fp32_prepared.eval()
model_int8 = torch.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

要了解有关量化感知训练的更多信息，请参阅 QAT 教程。

（原型）FX Graph Mode 量化¶

FX Graph 模式支持的量化类型可以分为两种方式：

训练后量化（训练后应用量化，根据样本校准数据计算量化参数）
量化感知训练（在训练期间模拟量化，以便可以使用训练数据与模型一起学习量化参数）

然后，这两者中的每一个都可以包括以下任何或所有类型：

仅权重量化（仅静态量化权重）
动态量化（权重静态量化，激活动态量化）
静态量化（权重和激活都是静态量化的）

这两种分类方式是独立的，所以理论上我们可以有 6 种不同类型的量化。

FX Graph Mode Quantization 中支持的量化类型包括：

训练后量化
- 仅权重量化
- 动态量化
- 静态量化
量化感知训练
- 静态量化

训练后量化有多种量化类型（仅权重、动态和静态），配置是通过 qconfig_dict （prepare_fx 函数的参数）完成的。

API 示例：

import torch.quantization.quantize_fx as quantize_fx
import copy

model_fp = UserModel(...)

#
# post training dynamic/weight_only quantization
#

# we need to deepcopy if we still want to keep model_fp unchanged after quantization since quantization apis change the input model
model_to_quantize = copy.deepcopy(model_fp)
model_to_quantize.eval()
qconfig_dict = {"": torch.quantization.default_dynamic_qconfig}
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_dict)
# no calibration needed when we only have dynamici/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# post training static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_dict = {"": torch.quantization.get_default_qconfig('qnnpack')}
model_to_quantize.eval()
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_dict)
# calibrate (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# quantization aware training for static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_dict = {"": torch.quantization.get_default_qat_qconfig('qnnpack')}
model_to_quantize.train()
# prepare
model_prepared = quantize_fx.prepare_qat_fx(model_to_quantize, qconfig_dict)
# training loop (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# fusion
#
model_to_quantize = copy.deepcopy(model_fp)
model_fused = quantize_fx.fuse_fx(model_to_quantize)

有关 FX Graph Mode Quantization 的更多信息，请参阅以下教程：

量化张量¶

PyTorch 支持每张量和每通道非对称线性量化。Per tensor 表示张量中的所有值都是缩放方式相同。Per channel 表示对于每个维度，通常张量的 channel 维度，值在张量中被缩放和偏移不同的值（实际上 scale 和 offset 变为 vectors）。这样可以在转换张量时减少错误量化值。

映射是通过使用

$_images/math-quantizer-equation.png$

请注意，我们确保浮点数中的 0 表示没有错误量化后，从而保证像 padding 这样的作不会造成额外的量化误差。

为了在 PyTorch 中进行量化，我们需要能够将 Tensor 中的量化数据。量化张量允许存储量化数据（表示为 int8/uint8/int32）以及量化比例和zero_point等参数。量化张量允许许多有用的作使量化算术变得简单，此外允许以量化格式对数据进行序列化。

量化作覆盖率¶

量化张量支持常规全精度张量。对于 PyTorch 中包含的 NN 运算符，我们将支持限制为：

8 位权重（data_type = qint8）

8 位激活（data_type = quint8）

请注意，目前仅 operator 实现支持 Conv 和 Linear 运算符权重的每通道量化。此外，输入数据的最小值和最大值为线性映射到量化数据的最小值和最大值 type 中，这样表示零且没有量化误差。

其他数据类型和量化方案可以通过以下方式实现自定义运算符机制。

量化张量的许多作都可以在与 full 相同的 API 下使用 float 版本在或 .NN 模块的量化版本，该执行重新量化中可用。那些作显式采用 Output Quantization 参数（Scale 和 zero_point）作签名。torchtorch.nntorch.nn.quantized

此外，我们还支持与常见熔融相对应的熔融版本影响量化的模式： torch.nn.intrinsic.quantized。

对于量化感知训练，我们支持为量化准备的模块在 torch.nn.qat 和 torch.nn.intrinsic.qat 接受 AWARE 培训

支持的作列表足以满足以下条件涵盖典型的 CNN 和 RNN 模型

量化自定义¶

虽然默认实现的观察者选择比例因子和偏差根据提供的观察到的张量数据，开发人员可以提供自己的量化函数。量化可以选择性地应用于不同的部分或针对模型的不同部分进行不同的配置。

我们还为 conv2d（）、conv3d（） 和 linear（） 提供了对每通道量化的支持

量化工作流程的工作原理是在模型的模块层次结构中添加（例如，将观察者添加为子模块）或替换（例如转换为）子模块。它表示模型在整个进程，因此可以与其余的 PyTorch API 一起使用。.observernn.Conv2dnn.quantized.Conv2dnn.Module

量化的模型准备¶

目前有必要对模型定义进行一些修改在量化之前。这是因为目前量化在模块上工作按模块为基础。具体来说，对于所有量化技术，用户需要：

转换任何需要输出重新量化的作（因此具有 additional parameters）从 functionals 到 module 形式（例如 using 而不是）。torch.nn.ReLUtorch.nn.functional.relu
通过在 submodules 上分配属性或指定来指定模型的哪些部分需要量化。例如，设置表示层不会被量化，设置意味着量化的设置将改用全局 qconfig 的 qconfig。.qconfigqconfig_dictmodel.conv1.qconfig = Nonemodel.convmodel.linear1.qconfig = custom_qconfigmodel.linear1custom_qconfig

对于量化激活的静态量化技术，用户需要以额外执行以下作：

指定激活的量化和去量化位置。这是使用QuantStub和DeQuantStub模块。
用torch.nn.quantized.FloatFunctional包装张量作需要特殊处理才能量化为 Module。例子是类似和的作，需要特殊处理确定输出量化参数。addcat
Fuse modules：将作 / 模块合并为单个模块以获得更高的精度和性能。这是使用torch.quantization.fuse_modules()API，它接受模块列表进行熔合。我们目前支持以下融合： [Conv， Relu]， [Conv， BatchNorm]， [Conv， BatchNorm， Relu]， [线性， Relu]

最佳实践¶

如果您使用的是后端，请将 observers 上的参数设置为 True。此参数可防止某些 int8 指令溢出通过将 quantized 数据类型的范围减少 1 位。reduce_rangefbgemm

常见错误¶

将非量化的 Tensor 传递到量化的内核中¶

如果您看到类似以下内容的错误：

RuntimeError: Could not run 'quantized::some_operator' with arguments from the 'CPU' backend...

这意味着您正在尝试将非量化的 Tensor 传递给量化的内核。一种常见的解决方法是量化张量。这需要在 Eager 模式量化中手动完成。一个 e2e 示例：torch.quantization.QuantStub

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = torch.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)

    def forward(self, x):
        # during the convert step, this will be replaced with a
        # `quantize_per_tensor` call
        x = self.quant(x)
        x = self.conv(x)
        return x

将量化的 Tensor 传递到非量化的内核中¶

如果您看到类似以下内容的错误：

RuntimeError: Could not run 'aten::thnn_conv2d_forward' with arguments from the 'QuantizedCPU' backend.

这意味着您正在尝试将量化的 Tensor 传递给非量化的内核。一种常见的解决方法是 dequantize 张量。这需要在 Eager 模式量化中手动完成。一个 e2e 示例：torch.quantization.DeQuantStub

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = torch.quantization.QuantStub()
        self.conv1 = torch.nn.Conv2d(1, 1, 1)
        # this module will not be quantized (see `qconfig = None` logic below)
        self.conv2 = torch.nn.Conv2d(1, 1, 1)
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        # during the convert step, this will be replaced with a
        # `quantize_per_tensor` call
        x = self.quant(x)
        x = self.conv1(x)
        # during the convert step, this will be replaced with a
        # `dequantize` call
        x = self.dequant(x)
        x = self.conv2(x)
        return x

m = M()
m.qconfig = some_qconfig
# turn off quantization for conv2
m.conv2.qconfig = None

提供量化函数和类的模块¶

torch.quantization	此模块实现了您直接调用的函数，以将模型从 FP32 转换为量化形式。例如，`prepare()`用于训练后量化为校准步骤准备模型，并将`convert()`实际上将权重转换为 int8 并将作替换为其量化的对应作。有其他辅助函数，用于量化模型并执行 conv+relu 等关键融合。
torch.nn.intrinsic	该模块实现了组合的（融合的）模块 conv + relu，它可以然后被量化。
torch.nn.intrinsic.qat	该模块实现了量化感知训练。
torch.nn.intrinsic.quantized	该模块实现了融合作的量化实现比如 conv + relu。
torch.nn.qat	此模块实现了关键 nn 模块 Conv2d（）和 Linear（）的版本，它们在 FP32 中运行，但应用了四舍五入来模拟 INT8 量化的影响。
torch.nn.quantized	该模块实现了 nn 层的量化版本，例如 ~'torch.nn.Conv2d' 和 torch.nn.ReLU.
torch.nn.quantized.dynamic	动态量化`Linear`,`LSTM`,`LSTMCell`,`GRUCell`和`RNNCell`.

量化¶

量化简介¶

原生支持的后端¶

量化 API 总结¶

Eager 模式量化¶

动态量化¶

静态量化¶

量化感知训练¶

（原型）FX Graph Mode 量化¶

量化张量¶

量化作覆盖率¶

量化自定义¶

量化的模型准备¶

最佳实践¶

常见错误¶

将非量化的 Tensor 传递到量化的内核中¶

将量化的 Tensor 传递到非量化的内核中¶

提供量化函数和类的模块¶

文档

教程

资源