量化¶

警告

Quantization 目前处于测试阶段，可能会发生变化。

量化简介¶

量化是指执行计算和存储的技术 bitwidth 低于浮点精度的张量。量化模型以较低的精度对张量执行部分或全部作，而不是全精度（浮点）值。这允许更紧凑的模型表示和在许多硬件平台上使用高性能矢量化作。与典型的 FP32 模型相比，PyTorch 支持 INT8 量化，允许模型大小减小 4 倍，内存带宽减少 4 倍要求。INT8 计算的硬件支持通常为 2 到 4 比 FP32 计算快倍。量化主要是一种技术加快推理速度，量化仅支持前向传递运维。

PyTorch 支持多种量化深度学习模型的方法。在大多数情况下，模型在 FP32 中训练，然后将模型转换为 INT8 的。此外，PyTorch 还支持量化感知训练，这使用 fake-quantization 模块。请注意，整个计算是在浮点。在量化感知训练结束时，PyTorch 提供 conversion 函数将训练后的模型转换为较低的精度。

在较低级别，PyTorch 提供了一种表示量化张量和使用它们执行作。它们可用于直接构建模型以较低的精度执行全部或部分计算。更高级别提供的 API 包含转换 FP32 模型的典型工作流程以最小的精度损失降低精度。

量化 API 总结¶

PyTorch 提供三种不同的量化模式：Eager Mode Quantization、FX Graph Mode Quantization（维护）和 PyTorch 2 Export Quantization。

Eager Mode Quantization 是一项测试版功能。用户需要进行融合并手动指定量化和反量化发生的位置，而且它只支持模块，不支持函数。

FX Graph Mode Quantization 是 PyTorch 中的自动量化工作流程，目前它是一个原型功能，由于我们有 PyTorch 2 Export Quantization，因此它处于维护模式。它通过添加对泛函的支持和自动化量化过程来改进 Eager Mode Quantization，尽管人们可能需要重构模型以使模型与 FX Graph Mode Quantization（符号可追溯）兼容。请注意，FX Graph Mode Quantization 预计不会适用于任意模型，因为该模型可能无法符号地追溯，我们会将其集成到 torchvision 等域库中，用户将能够量化类似于使用 FX Graph Mode Quantization 支持的域库中的模型。对于任意模型，我们将提供一般准则，但要使其真正工作，用户可能需要熟悉，尤其是如何使模型符号可追溯。torch.fxtorch.fx

PyTorch 2 导出量化是新的全图形模式量化工作流，在 PyTorch 2.1 中作为原型功能发布。在 PyTorch 2 中，我们正在转向更好的完整程序捕获解决方案（torch.export），因为与 torch.fx.symbolic_trace FX Graph Mode Quantization 使用的程序捕获解决方案（14K 模型上为 72.7%）相比，它可以捕获更高百分比的模型（在 14K 模型上为 88.8%）。torch.export 在一些 Python 结构方面仍然存在限制，并且需要用户参与才能支持导出模型中的动态性，但总体而言，它比以前的程序捕获解决方案有所改进。PyTorch 2 Export Quantization 专为 torch.export 捕获的模型而构建，同时考虑了建模用户和后端开发人员的灵活性和生产力。主要特点是（1）. 可编程 API，用于配置模型的量化方式，可扩展到更多用例（2）. 简化了建模用户和后端开发人员的 UX，因为他们只需要与单个对象（Quantizer）交互，以表达用户关于如何量化模型以及后端支持的意图。（3）. 可选的参考量化模型表示，可以使用整数运算表示量化计算，该运算更接近于硬件中发生的实际量化计算。

建议量化新用户先试用 PyTorch 2 Export Quantization，如果效果不佳，用户可以尝试 Eager 模式量化。

下表比较了 Eager Mode Quantization、FX Graph Mode Quantization 和 PyTorch 2 Export Quantization 之间的区别：

	Eager 模式量化	FX Graph 模式量化	PyTorch 2 导出量化
释放地位	试用版	原型（维护）	原型
算子融合	手动	自动	自动
量化/DeQuant 放置	手动	自动	自动
量化模块	支持	支持	支持
量化功能/Torch 老年退休金计划	手动	自动	支持
支持定制	有限支持	完全地支持	完全支持
量化模式支持	培训后量化：静态、动态、仅重量量化感知训练：静态的	培训后量化：静态、动态、仅重量量化感知训练：静态的	定义者特定于后端量化
输入/输出型号类型	`torch.nn.Module`	`torch.nn.Module`（可能需要一些重构为模型与 FX 兼容图形模式量化）	`torch.fx.GraphModule`（捕获者`torch.export`

支持三种类型的量化：

动态量化（使用读取/存储在浮点和量化计算）
静态量化（量化权重、量化激活、校准所需的岗位培训）
静态量化感知训练（量化权重、量化激活、训练期间建模的量化数值）

请参阅我们的 PyTorch 上的量化简介博客文章更全面地概述这些量化之间的权衡类型。

运算符覆盖范围在动态和静态量化之间有所不同，如下表所示。

	静态的量化	动态量化
nn.线性 nn.卷积 1d/2d/3d	Y Y	Y N
nn.LSTM 系列 nn.格鲁乌	Y （通过自定义模块） N	Y Y
nn.RNNCell nn.GRUCell nn.LSTMCell	N N N	Y Y Y
nn.嵌入包	Y（激活数在 FP32 中）	Y
nn.嵌入	Y	Y
nn.多头注意	Y （通过自定义模块）	不支持
激活	广泛支持	未更改、计算待在 FP32

Eager 模式量化¶

有关量化流程的一般介绍，包括不同类型的量化，请查看通用量化流程。

训练后动态量化¶

这是最简单的量化形式，其中权重为提前量化，但激活是动态量化的在推理期间。这用于模型执行时间主要从内存中加载权重，而不是计算矩阵乘法。对于 LSTM 和 Transformer 类型的模型来说，这是正确的，其中小批量。

图：

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                 /
linear_weight_fp32

# dynamically quantized model
# linear and LSTM weights are in int8
previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
                     /
   linear_weight_int8

PTDQ API 示例：

import torch

# define a floating point model
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(4, 4)

    def forward(self, x):
        x = self.fc(x)
        return x

# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp32,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  # the target dtype for quantized weights

# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)

要了解有关动态量化的更多信息，请参阅我们的动态量化教程。

训练后静态量化¶

训练后静态量化（PTQ static）量化模型的权重和激活。它尽可能将激活融合到前面的层中。它要求使用代表性数据集进行校准以确定最佳量化激活参数。训练后静态量化通常用于内存带宽和计算节省都很重要，因为 CNN 是一种典型用例。

在应用训练后静态量化之前，我们可能需要修改模型。请参阅 Eager Mode Static Quantization 的模型准备。

图：

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                    /
    linear_weight_fp32

# statically quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                    /
  linear_weight_int8

PTSQ API 示例：

import torch

# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')

# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
# Common fusions include `conv + relu` and `conv + batchnorm + relu`
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])

# Prepare the model for static quantization. This inserts observers in
# the model that will observe activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

要了解有关静态量化的更多信息，请参阅静态量化教程。

静态量化的量化感知训练¶

量化感知训练（QAT）对训练期间量化的效果进行建模与其他量化方法相比，可实现更高的准确性。我们可以对静态、动态或仅权重量化进行 QAT。在 training 中，所有计算都是以浮点形式完成的，具有 fake_quant 个模块通过固定和舍入对量化效果进行建模，以模拟 INT8 的影响。模型转换后，权重和激活被量化，激活被融合到前一层在可能的情况下。它通常与 CNN 一起使用，并产生更高的准确性与静态量化相比。

在应用训练后静态量化之前，我们可能需要修改模型。请参阅 Eager Mode Static Quantization 的模型准备。

图：

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                      /
    linear_weight_fp32

# model with fake_quants for modeling quantization numerics during training
previous_layer_fp32 -- fq -- linear_fp32 -- activation_fp32 -- fq -- next_layer_fp32
                           /
   linear_weight_fp32 -- fq

# quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                     /
   linear_weight_int8

QAT API 示例：

import torch

# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.bn = torch.nn.BatchNorm2d(1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval for fusion to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')

# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32,
    [['conv', 'bn', 'relu']])

# Prepare the model for QAT. This inserts observers and fake_quants in
# the model needs to be set to train for QAT logic to work
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())

# run the training loop (not shown)
training_loop(model_fp32_prepared)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, fuses modules where appropriate,
# and replaces key operators with quantized implementations.
model_fp32_prepared.eval()
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

要了解有关量化感知训练的更多信息，请参阅 QAT 教程。

Eager Mode 静态量化的模型准备¶

目前有必要对模型定义进行一些修改在 Eager 模式量化之前。这是因为目前量化在模块上工作按模块为基础。具体来说，对于所有量化技术，用户需要：

转换任何需要输出重新量化的作（因此具有 additional parameters）从 functionals 到 module 形式（例如 using 而不是）。torch.nn.ReLUtorch.nn.functional.relu
通过在 submodules 上分配属性或指定来指定模型的哪些部分需要量化。例如，设置表示层不会被量化，设置意味着量化的设置将改用全局 qconfig 的 qconfig。.qconfigqconfig_mappingmodel.conv1.qconfig = Nonemodel.convmodel.linear1.qconfig = custom_qconfigmodel.linear1custom_qconfig

对于量化激活的静态量化技术，用户需要以额外执行以下作：

指定激活的量化和去量化位置。这是使用QuantStub和DeQuantStub模块。
用FloatFunctional包装张量作需要特殊处理才能量化为 Module。例子是类似和的作，需要特殊处理确定输出量化参数。addcat
Fuse modules：将作 / 模块合并为单个模块以获得更高的精度和性能。这是使用fuse_modules()API，它接受模块列表进行熔合。我们目前支持以下融合： [Conv， Relu]， [Conv， BatchNorm]， [Conv， BatchNorm， Relu]， [线性， Relu]

（原型 - 维护模式）FX Graph Mode 量化¶

训练后量化有多种量化类型（仅权重、动态和静态），配置是通过 qconfig_mapping （prepare_fx 函数的参数）完成的。

FXPTQ API 示例：

import torch
from torch.ao.quantization import (
  get_default_qconfig_mapping,
  get_default_qat_qconfig_mapping,
  QConfigMapping,
)
import torch.ao.quantization.quantize_fx as quantize_fx
import copy

model_fp = UserModel()

#
# post training dynamic/weight_only quantization
#

# we need to deepcopy if we still want to keep model_fp unchanged after quantization since quantization apis change the input model
model_to_quantize = copy.deepcopy(model_fp)
model_to_quantize.eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_dynamic_qconfig)
# a tuple of one or more example inputs are needed to trace the model
example_inputs = (input_fp32)
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
# no calibration needed when we only have dynamic/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# post training static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qconfig_mapping("qnnpack")
model_to_quantize.eval()
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
# calibrate (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# quantization aware training for static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qat_qconfig_mapping("qnnpack")
model_to_quantize.train()
# prepare
model_prepared = quantize_fx.prepare_qat_fx(model_to_quantize, qconfig_mapping, example_inputs)
# training loop (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# fusion
#
model_to_quantize = copy.deepcopy(model_fp)
model_fused = quantize_fx.fuse_fx(model_to_quantize)

请按照以下教程了解有关 FX Graph Mode Quantization 的更多信息：

（原型）PyTorch 2 导出量化¶

API 示例：

import torch
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantizer import (
    XNNPACKQuantizer,
    get_symmetric_quantization_config,
)

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 10)

   def forward(self, x):
       return self.linear(x)

# initialize a floating point model
float_model = M().eval()

# define calibration function
def calibrate(model, data_loader):
    model.eval()
    with torch.no_grad():
        for image, target in data_loader:
            model(image)

# Step 1. program capture
# NOTE: this API will be updated to torch.export API in the future, but the captured
# result should mostly stay the same
m = capture_pre_autograd_graph(m, *example_inputs)
# we get a model with aten ops

# Step 2. quantization
# backend developer will write their own Quantizer and expose methods to allow
# users to express how they
# want the model to be quantized
quantizer = XNNPACKQuantizer().set_global(get_symmetric_quantization_config())
# or prepare_qat_pt2e for Quantization Aware Training
m = prepare_pt2e(m, quantizer)

# run calibration
# calibrate(m, sample_inference_data)
m = convert_pt2e(m)

# Step 3. lowering
# lower to target backend

请按照以下教程开始使用 PyTorch 2 导出量化：

建模用户：

后端开发人员（请查看所有建模用户文档）：

如何为 PyTorch 2 导出量化编写量化器

量化堆栈¶

量化是将浮点模型转换为量化模型的过程。因此，在高层次上，量化堆栈可以分为两部分： 1）. 量化模型的构建块或抽象 2）。将浮点模型转换为量化模型的量化流程的构建块或抽象

量化模型¶

量化张量¶

为了在 PyTorch 中进行量化，我们需要能够将 Tensor 中的量化数据。量化张量允许存储量化数据（表示为 int8/uint8/int32）以及量化比例和zero_point等参数。量化张量允许许多有用的作使量化算术变得简单，此外允许以量化格式对数据进行序列化。

PyTorch 支持每张量和每通道对称和非对称量化。Per tensor 表示张量中的所有值都使用相同的量化参数以相同的方式量化。Per channel 意味着对于每个维度，通常是张量的 channel 维度，张量中的值使用不同的量化参数进行量化。这样可以减少将张量转换为量化值的误差，因为异常值只会影响它所在的通道，而不是整个张量。

映射是通过使用

$_images/math-quantizer-equation.png$

请注意，我们确保浮点数中的 0 表示没有错误量化后，从而保证像 padding 这样的作不会造成额外的量化误差。

以下是量化 Tensor 的一些关键属性：

QScheme （torch.qscheme）：一个枚举，指定我们量化 Tensor 的方式
- torch.per_tensor_affine
- torch.per_tensor_symmetric
- torch.per_channel_affine
- torch.per_channel_symmetric
dtype （torch.dtype）：量化 Tensor 的数据类型
- torch.quint8 的
- torch.qint8 的
- torch.qint32 的
- Torch.float16
量化参数（因 QScheme 而异）：所选量化方式的参数
- torch.per_tensor_affine 的量化参数为
  - 比例（float）
  - zero_point （int）
- torch.per_channel_affine 的量化参数为
  - per_channel_scales （float 列表）
  - per_channel_zero_points （int 列表）
  - 轴（int）

量化和反量化¶

模型的输入和输出是浮点 Tensor，但量化模型中的激活是量化的，因此我们需要运算符在浮点和量化 Tensor 之间进行转换。

量化（float -> quantized）
- torch.quantize_per_tensor（x、scale、zero_point、dtype）
- torch.quantize_per_channel（x、刻度、zero_points、轴、dtype）
- torch.quantize_per_tensor_dynamic（x， dtype， reduce_range）
- to（torch.float16）的
Dequantize （quantized -> float）（反量化 - 浮点数）
- quantized_tensor.dequantize（） - 在 torch.float16 Tensor 上调用 dequantize 会将 Tensor 转换回 torch.float
- torch.dequantize（x）的

量化运算符/模块¶

Quantized Operator 是将量化后的 Tensor 作为输入，并输出量化后的 Tensor 的算子。
量化模块是执行量化作的 PyTorch 模块。它们通常是为 linear 和 conv 等加权运算定义的。

量化引擎¶

执行量化模型时，qengine （torch.backends.quantized.engine）指定要用于执行的后端。重要的是要确保 qengine 在量化激活的值范围和权重方面与量化模型兼容。

量化流程¶

Observer 和 FakeQuantize¶

Observer 是 PyTorch 模块，用于：
- 收集 Tensor 统计信息，例如通过观察器的 Tensor 的最小值和最大值
- 并根据收集的 Tensor 统计数据计算量化参数
FakeQuantize 是 PyTorch 模块，用于：
- 模拟网络中 Tensor 的量化（执行量化/反量化）
- 它可以根据从 Observer 收集的统计数据计算量化参数，也可以学习量化参数

QConfig¶

QConfig 是 Observer 或 FakeQuantize Module 类的命名元组，可以使用 qscheme、dtype 等进行配置。它用于配置应如何观察运算符
- 运算符/模块的量化配置
  - 不同类型的 Observer/FakeQuantize
  - DTYPE
  - qscheme
  - quant_min/quant_max：可用于模拟较低精度的 Tensor
- 目前支持 activation 和 weight 的配置
- 我们根据为给定运算符或模块配置的 qconfig 插入 input/weight/output observer

一般量化流程¶

一般来说，流程如下

准备
- 根据用户指定的 qconfig 插入 Observer/FakeQuantize 模块
校准/训练（取决于训练后量化或量化感知训练）
- 允许 Observers 收集统计数据或 FakeQuantize 模块来学习量化参数
转换
- 将校准/训练的模型转换为量化模型

量化有多种模式，它们可以分为两种方式：

就我们应用量化流程的位置而言，我们有：

训练后量化（训练后应用量化，根据样本校准数据计算量化参数）
量化感知训练（在训练期间模拟量化，以便可以使用训练数据与模型一起学习量化参数）

就我们如何量化算子而言，我们可以有：

仅权重量化（仅静态量化权重）
动态量化（权重静态量化，激活动态量化）
静态量化（权重和激活都是静态量化的）

我们可以在同一个量化流程中混合不同的量化算子方法。例如，我们可以拥有同时具有静态和动态量化算子的训练后量化。

量化支持矩阵¶

量化模式支持¶

	量化模式		数据要求	最适合	准确性	笔记
训练后量化	Dynamic/Weight Only 量化	激活动态量化（FP16， int8）或不量化，权重静态量化（FP16、INT8、IN4）	没有	LSTM、MLP、嵌入变压器	好	易于使用，接近静态量化 performance 为计算或内存绑定由于权重
训练后量化	静态量化	activation 和静态权重量化（int8）	校准数据	美国有线电视新闻网（CNN	好	提供最佳 perf 的影响大准确性，良好用于硬件那个只有支援 int8 计算
量化感知训练	动态量化	activation 和体重是假的量化	微调数据	MLP、嵌入	最好	有限的支持目前
量化感知训练	静态量化	activation 和体重是假的量化	微调数据	美国有线电视新闻网（CNN）、MLP、嵌入	最好	通常使用静态时量化导致糟糕 accuracy 和用于关闭准确性差距

请参阅我们的 Pytorch 上的量化简介博客文章更全面地概述这些量化之间的权衡类型。

量化流程支持¶

PyTorch 提供两种量化模式：Eager Mode Quantization 和 FX Graph Mode Quantization。

Eager Mode Quantization 是一项测试版功能。用户需要进行融合并手动指定量化和反量化发生的位置，而且它只支持模块，不支持函数。

FX Graph Mode Quantization 是 PyTorch 中的一个自动量化框架，目前它是一个原型功能。它通过添加对泛函的支持和自动化量化过程来改进 Eager Mode Quantization，尽管人们可能需要重构模型以使模型与 FX Graph Mode Quantization（符号可追溯）兼容。请注意，FX Graph Mode Quantization 预计不会适用于任意模型，因为该模型可能无法符号地追溯，我们会将其集成到 torchvision 等域库中，用户将能够量化类似于使用 FX Graph Mode Quantization 支持的域库中的模型。对于任意模型，我们将提供一般准则，但要使其真正工作，用户可能需要熟悉，尤其是如何使模型符号可追溯。torch.fxtorch.fx

建议量化的新用户先尝试 FX Graph 模式量化，如果不起作用，用户可以尝试按照使用 FX Graph 模式量化的指南或回退到 Eager 模式量化。

下表比较了 Eager Mode Quantization 和 FX Graph Mode Quantization 之间的区别：

	Eager 模式量化	FX Graph 模式量化
释放地位	试用版	原型
算子融合	手动	自动
量化/DeQuant 放置	手动	自动
量化模块	支持	支持
量化功能/Torch 老年退休金计划	手动	自动
支持定制	有限支持	完全地支持
量化模式支持	培训后量化：静态、动态、仅重量量化感知训练：静态的	培训后量化：静态、动态、仅重量量化感知训练：静态的
输入/输出型号类型	`torch.nn.Module`	`torch.nn.Module`（可能需要一些重构为模型与 FX 兼容图形模式量化）

后端/硬件支持¶

硬件	内核库	Eager 模式量化	FX Graph 模式量化	量化模式支持
服务器 CPU	FBGEMM/onednn	支持		都支持
移动 CPU	QNNPACK/XNNPACK	支持		都支持
服务器 GPU	TensorRT（早期原型）	不支持这个需要图	支持	静态的量化

目前，PyTorch 支持以下后端，用于高效运行量化算子：

具有 AVX2 支持或更高版本的 x86 CPU（如果没有 AVX2，某些作的实现效率低下），通过由 fbgemm 和 onednn 优化的 x86（请参阅 RFC 中的详细信息)
ARM CPU（通常位于移动/嵌入式设备中），通过 qnnpack
（早期原型）通过 TensorRT 通过 fx2trt 支持 NVidia GPU（待开源）

本机 CPU 后端注意事项¶

我们使用相同的原生 pytorch 量化运算符公开 x86 和 qnnpack，因此我们需要额外的标志来区分它们。x86 和 qnnpack 的相应实现是根据 PyTorch 构建模式自动选择的，但用户可以选择通过将 torch.backends.quantization.engine 设置为 x86 或 qnnpack 来覆盖它。

在准备量化模型时，需要确保 qconfig ，用于量化计算的引擎与后端匹配模型将被执行。qconfig 控制使用的观察者类型在 Quantization 通道期间。qengine 控制在打包线性和卷积函数和模块。例如：

x86 的默认设置：

# set the qconfig for PTQ
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default on x86 CPUs
qconfig = torch.ao.quantization.get_default_qconfig('x86')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'x86'

qnnpack 的默认设置：

# set the qconfig for PTQ
qconfig = torch.ao.quantization.get_default_qconfig('qnnpack')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('qnnpack')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'qnnpack'

作员支持¶

运算符覆盖范围在动态和静态量化之间有所不同，如下表所示。请注意，对于 FX Graph Mode Quantization，还支持相应的函数。

	静态的量化	动态量化
nn.线性 nn.卷积 1d/2d/3d	Y Y	Y N
nn.LSTM 系列 nn.格鲁乌	N N	Y Y
nn.RNNCell nn.GRUCell nn.LSTMCell	N N N	Y Y Y
nn.嵌入包	Y（激活数在 FP32 中）	Y
nn.嵌入	Y	Y
nn.多头注意	不支持	不支持
激活	广泛支持	未更改、计算待在 FP32

注意：这将很快使用本机backend_config_dict生成的一些信息进行更新。

量化 API 参考¶

量化 API 参考包含文档量化 API，例如量化传递、量化张量运算、以及支持的量化模块和函数。

量化后端配置¶

量化后端配置包含文档了解如何为各种后端配置量化工作流。

量化精度调试¶

量化精度调试包含文档了解如何调试量化精度。

量化自定义¶

虽然默认实现的观察者选择比例因子和偏差根据提供的观察到的张量数据，开发人员可以提供自己的量化函数。量化可以选择性地应用于不同的部分或针对模型的不同部分进行不同的配置。

我们还为 conv1d（）、conv2d（）、conv3d（） 和 linear（） 提供了每通道量化支持。

量化工作流程的工作原理是在模型的模块层次结构中添加（例如，将观察者添加为子模块）或替换（例如转换为）子模块。它表示模型在整个进程，因此可以与其余的 PyTorch API 一起使用。.observernn.Conv2dnn.quantized.Conv2dnn.Module

量化自定义模块 API¶

Eager 模式和 FX 图形模式量化 API 都为用户提供了 hook 以自定义方式指定 Module Quantized，并使用用户定义的 logic for 观察和量化。用户需要指定：

源 fp32 模块的 Python 类型（存在于模型中）
被观察模块的 Python 类型（由用户提供）。此模块需要定义一个 from_float 函数，该函数定义被观察模块的从原始 FP32 模块创建。
量化模块的 Python 类型（由用户提供）。此模块需要定义一个 from_observed 函数，该函数定义量化模块的从 Observed Module 创建。
上面描述（1）、（2）、（3）的配置，传递给量化 API。

然后，框架将执行以下作：

在 Prepare Module Swaps 期间，它将转换在（1）中指定为（2）中指定的类型，使用 from_float 函数（2）中的类。
在 Convert Module Swaps（转换模块交换）期间，它将转换在（2）中指定为（3）中指定的类型，使用 from_observed 函数（3）中的类。

目前，要求 ObservedCustomModule 将具有单个 Tensor 输出，并且观察者将由框架（而不是用户）添加在该输出上。观察者将存储在 activation_post_process 键下作为自定义模块实例的属性。放宽这些限制可能会在将来的时间完成。

自定义 API 示例：

import torch
import torch.ao.nn.quantized as nnq
from torch.ao.quantization import QConfigMapping
import torch.ao.quantization.quantize_fx

# original fp32 module to replace
class CustomModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(3, 3)

    def forward(self, x):
        return self.linear(x)

# custom observed module, provided by user
class ObservedCustomModule(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear

    def forward(self, x):
        return self.linear(x)

    @classmethod
    def from_float(cls, float_module):
        assert hasattr(float_module, 'qconfig')
        observed = cls(float_module.linear)
        observed.qconfig = float_module.qconfig
        return observed

# custom quantized module, provided by user
class StaticQuantCustomModule(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear

    def forward(self, x):
        return self.linear(x)

    @classmethod
    def from_observed(cls, observed_module):
        assert hasattr(observed_module, 'qconfig')
        assert hasattr(observed_module, 'activation_post_process')
        observed_module.linear.activation_post_process = \
            observed_module.activation_post_process
        quantized = cls(nnq.Linear.from_float(observed_module.linear))
        return quantized

#
# example API call (Eager mode quantization)
#

m = torch.nn.Sequential(CustomModule()).eval()
prepare_custom_config_dict = {
    "float_to_observed_custom_module_class": {
        CustomModule: ObservedCustomModule
    }
}
convert_custom_config_dict = {
    "observed_to_quantized_custom_module_class": {
        ObservedCustomModule: StaticQuantCustomModule
    }
}
m.qconfig = torch.ao.quantization.default_qconfig
mp = torch.ao.quantization.prepare(
    m, prepare_custom_config_dict=prepare_custom_config_dict)
# calibration (not shown)
mq = torch.ao.quantization.convert(
    mp, convert_custom_config_dict=convert_custom_config_dict)
#
# example API call (FX graph mode quantization)
#
m = torch.nn.Sequential(CustomModule()).eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_qconfig)
prepare_custom_config_dict = {
    "float_to_observed_custom_module_class": {
        "static": {
            CustomModule: ObservedCustomModule,
        }
    }
}
convert_custom_config_dict = {
    "observed_to_quantized_custom_module_class": {
        "static": {
            ObservedCustomModule: StaticQuantCustomModule,
        }
    }
}
mp = torch.ao.quantization.quantize_fx.prepare_fx(
    m, qconfig_mapping, torch.randn(3,3), prepare_custom_config=prepare_custom_config_dict)
# calibration (not shown)
mq = torch.ao.quantization.quantize_fx.convert_fx(
    mp, convert_custom_config=convert_custom_config_dict)

最佳实践¶

1. 如果您使用的是后端，我们需要使用 7 位而不是 8 位。确保减小，，的范围，例如如果是，请确保将自定义设置为 BE 和 BE （ / ）如果是，请确保将自定义设置为（ / ）和为（），我们已经正确设置了如果调用 torch.ao.quantization.get_default_qconfig（backend）或 torch.ao.quantization.get_default_qat_qconfig（backend）函数来获取默认值 for 或 backendx86quant\_minquant\_maxdtypetorch.quint8quant_min0quant_max1272552dtypetorch.qint8quant_min-64-1282quant_max631272qconfigx86qnnpack

2. 如果选择了 backend，则默认 qconfig 映射和默认 qconfig 将使用 8 位激活。建议在具有矢量神经网络指令（VNNI）的 CPU 上使用支持。否则，将激活的观察程序设置为 True，以便在不支持 VNNI 的 CPU 上获得更好的准确性。onednntorch.ao.quantization.get_default_qconfig_mapping('onednn')torch.ao.quantization.get_default_qconfig('onednn')reduce_range

常见问题解答¶

如何在 GPU 上进行量化推理？

我们还没有官方的 GPU 支持，但这是一个正在积极开发的领域，您可以在此处找到更多信息
我在哪里可以获得对量化模型的 ONNX 支持？

如果您在导出模型时遇到错误（使用下的 API），您可以在 PyTorch 存储库中打开一个问题。在问题标题前加上前缀，并将问题标记为。torch.onnx[ONNX]module: onnx

如果您在使用 ONNX 运行时时遇到问题，请在 GitHub 上提出问题 - microsoft/onnxruntime。
如何在 LSTM 中使用量化？

LSTM 通过我们的自定义模块 api 在 Eager 模式和 fx Graph 模式量化中都受支持。示例可以在 Eager 模式：pytorch/test_quantized_op.py TestQuantizedOps.test_custom_module_lstm FX 图形模式：pytorch/test_quantize_fx.py TestQuantizeFx.test_static_lstm

常见错误¶

将非量化的 Tensor 传递到量化的内核中¶

如果您看到类似以下内容的错误：

RuntimeError: Could not run 'quantized::some_operator' with arguments from the 'CPU' backend...

这意味着您正在尝试将非量化的 Tensor 传递给量化的内核。一种常见的解决方法是量化张量。这需要在 Eager 模式量化中手动完成。一个 e2e 示例：torch.ao.quantization.QuantStub

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)

    def forward(self, x):
        # during the convert step, this will be replaced with a
        # `quantize_per_tensor` call
        x = self.quant(x)
        x = self.conv(x)
        return x

将量化的 Tensor 传递到非量化的内核中¶