（测试版）BERT 上的动态量化¶

创建时间： 2019-12-06 |上次更新时间：2024 年 8 月 29 日 |上次验证： Nov 05， 2024

提示

为了充分利用本教程，我们建议使用此 Colab 版本。这将允许您试验下面提供的信息。

作者： Jianyu Huang

测评者： Raghuraman Krishnamoorthi

编辑：Jessica Lin

介绍¶

在本教程中，我们将在 BERT 上应用动态量化模型，紧随 HuggingFace 的 BERT 模型变压器示例。通过此分步旅程，我们想演示如何将 BERT 等众所周知的先进模型转换为动态模型量化模型。

BERT 或 Bidirectional Embedding Representations from Transformers，是一种预训练语言表示的新方法，它在许多常用的自然语言处理（NLP）任务，例如问答、文本分类等。原始论文可以在这里找到。
PyTorch 中的动态量化支持将浮点模型转换为具有静态 int8 或 float16 数据类型的量化模型，用于 WEIGHTS 和 Dynamic Quantization 的 Activation。激活当权重为量化为 int8。在 PyTorch 中，我们有 torch.quantization.quantize_dynamic API，它将指定的模块替换为仅动态权重量化版本并输出量化模型。
我们在一般语言理解评估基准（GLUE）中演示了 Microsoft Research Paraphrase Corpus （MRPC）任务的准确性和推理性能结果。MRPC（Dolan 和 Brockett，2005 年）是从在线新闻中自动提取的句子对语料库 sources 的 Sources 中，并带有对句子是否对在语义上是等效的。由于职业不平衡（68%）阳性，32% 阴性），我们遵循常见做法并报告 F1 评分。 MRPC 是语言对分类的常见 NLP 任务，如图所示下面。

1. 设置¶

1.1 安装 PyTorch 和 HuggingFace 变压器¶

要开始本教程，我们首先按照安装说明进行作在 PyTorch 中这里和 HuggingFace Github Repo 中这里。此外，我们还安装了 scikit-learn 包，因为我们将重用它的内置 F1 分数计算辅助函数。

pip install sklearn
pip install transformers==4.29.2

因为我们将使用 PyTorch 的 beta 部分，所以它是建议安装最新版本的 Torch 和 TorchVision。你可以在此处找到有关本地安装的最新说明。例如，要在苹果电脑：

yes y | pip uninstall torch torchvision
yes y | pip install --pre torch -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html

1.2 导入必要的模块¶

在此步骤中，我们将导入本教程所需的 Python 模块。

import logging
import numpy as np
import os
import random
import sys
import time
import torch

from argparse import Namespace
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from tqdm import tqdm
from transformers import (BertConfig, BertForSequenceClassification, BertTokenizer,)
from transformers import glue_compute_metrics as compute_metrics
from transformers import glue_output_modes as output_modes
from transformers import glue_processors as processors
from transformers import glue_convert_examples_to_features as convert_examples_to_features

# Setup logging
logger = logging.getLogger(__name__)
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.WARN)

logging.getLogger("transformers.modeling_utils").setLevel(
   logging.WARN)  # Reduce logging

print(torch.__version__)

我们设置线程数来比较 FP32 和 INT8 性能之间的单线程性能。在本教程的最后，用户可以通过构建具有正确并行后端的 PyTorch 来设置其他数量的线程。

torch.set_num_threads(1)
print(torch.__config__.parallel_info())

1.3 了解辅助函数¶

辅助函数内置在 transformers 库中。我们主要使用以下辅助函数：一个用于转换文本示例到特征向量中;另一个用于测量 F1 分数预测结果。

glue_convert_examples_to_features 函数将文本转换为输入特征：

对输入序列进行标记化;
在开头插入 [CLS];
在第一句和第二句之间插入 [SEP]，以及最后;
生成令牌类型 ID 以指示令牌是否属于第一个序列或第二个序列。

glue_compute_metrics 函数具有 F1 分数，其中可以解释为 precision 和 recall 的加权平均值，其中 F1 分数在 1 处达到其最佳值，在 0 处达到最差分数。这精确率和召回率对 F1 分数的相对贡献相等。

F1 分数的公式为：

\[F1 = 2 * （\text{precision} * \text{recall}） / （\text{precision} + \text{recall}） \]

1.4 下载数据集¶

在运行 MRPC 任务之前，我们通过运行此脚本下载 GLUE 数据并将其解压缩到目录 .glue_data

python download_glue_data.py --data_dir='glue_data' --tasks='MRPC'

2. 微调 BERT 模型¶

BERT 的精神是预先训练语言表示，然后在广泛的任务，并实现最先进的结果。在本教程中，我们将重点介绍微调使用预先训练的 BERT 模型对语义等效进行分类 MRPC 任务上的句子对。

要微调预先训练的 BERT 模型（ HuggingFace transformers）执行 MRPC 任务，您可以按照命令例如：bert-base-uncased

export GLUE_DIR=./glue_data
export TASK_NAME=MRPC
export OUT_DIR=./$TASK_NAME/
python ./run_glue.py \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --save_steps 100000 \
    --output_dir $OUT_DIR

我们在此处为 MRPC 任务提供微调的 BERT 模型。为了节省时间，您可以将模型文件（~400 MB）直接下载到本地文件夹中。$OUT_DIR

2.1 设置全局配置¶

在这里，我们设置用于评估微调 BERT 的全局配置动态量化前后的模型。

configs = Namespace()

# The output directory for the fine-tuned model, $OUT_DIR.
configs.output_dir = "./MRPC/"

# The data directory for the MRPC task in the GLUE benchmark, $GLUE_DIR/$TASK_NAME.
configs.data_dir = "./glue_data/MRPC"

# The model name or path for the pre-trained model.
configs.model_name_or_path = "bert-base-uncased"
# The maximum length of an input sequence
configs.max_seq_length = 128

# Prepare GLUE task.
configs.task_name = "MRPC".lower()
configs.processor = processors[configs.task_name]()
configs.output_mode = output_modes[configs.task_name]
configs.label_list = configs.processor.get_labels()
configs.model_type = "bert".lower()
configs.do_lower_case = True

# Set the device, batch size, topology, and caching flags.
configs.device = "cpu"
configs.per_gpu_eval_batch_size = 8
configs.n_gpu = 0
configs.local_rank = -1
configs.overwrite_cache = False


# Set random seed for reproducibility.
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
set_seed(42)

2.2 加载微调后的 BERT 模型¶

我们加载分词器和微调的 BERT 序列分类器模型（FP32）从 .configs.output_dir

tokenizer = BertTokenizer.from_pretrained(
    configs.output_dir, do_lower_case=configs.do_lower_case)

model = BertForSequenceClassification.from_pretrained(configs.output_dir)
model.to(configs.device)

2.3 定义 tokenize 和 evaluation 函数¶

我们重用了 HuggingFace 的 tokenize 和 evaluation 函数。

# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

def evaluate(args, model, tokenizer, prefix=""):
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
    eval_outputs_dirs = (args.output_dir, args.output_dir + '-MM') if args.task_name == "mnli" else (args.output_dir,)

    results = {}
    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)

        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(eval_output_dir)

        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
        # Note that DistributedSampler samples randomly
        eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

        # multi-gpu eval
        if args.n_gpu > 1:
            model = torch.nn.DataParallel(model)

        # Eval!
        logger.info("***** Running evaluation {} *****".format(prefix))
        logger.info("  Num examples = %d", len(eval_dataset))
        logger.info("  Batch size = %d", args.eval_batch_size)
        eval_loss = 0.0
        nb_eval_steps = 0
        preds = None
        out_label_ids = None
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            model.eval()
            batch = tuple(t.to(args.device) for t in batch)

            with torch.no_grad():
                inputs = {'input_ids':      batch[0],
                          'attention_mask': batch[1],
                          'labels':         batch[3]}
                if args.model_type != 'distilbert':
                    inputs['token_type_ids'] = batch[2] if args.model_type in ['bert', 'xlnet'] else None  # XLM, DistilBERT and RoBERTa don't use segment_ids
                outputs = model(**inputs)
                tmp_eval_loss, logits = outputs[:2]

                eval_loss += tmp_eval_loss.mean().item()
            nb_eval_steps += 1
            if preds is None:
                preds = logits.detach().cpu().numpy()
                out_label_ids = inputs['labels'].detach().cpu().numpy()
            else:
                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
                out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)

        eval_loss = eval_loss / nb_eval_steps
        if args.output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif args.output_mode == "regression":
            preds = np.squeeze(preds)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)

        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results {} *****".format(prefix))
            for key in sorted(result.keys()):
                logger.info("  %s = %s", key, str(result[key]))
                writer.write("%s = %s\n" % (key, str(result[key])))

    return results


def load_and_cache_examples(args, task, tokenizer, evaluate=False):
    if args.local_rank not in [-1, 0] and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    processor = processors[task]()
    output_mode = output_modes[task]
    # Load data features from cache or dataset file
    cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_{}'.format(
        'dev' if evaluate else 'train',
        list(filter(None, args.model_name_or_path.split('/'))).pop(),
        str(args.max_seq_length),
        str(task)))
    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
        logger.info("Creating features from dataset file at %s", args.data_dir)
        label_list = processor.get_labels()
        if task in ['mnli', 'mnli-mm'] and args.model_type in ['roberta']:
            # HACK(label indices are swapped in RoBERTa pretrained model)
            label_list[1], label_list[2] = label_list[2], label_list[1]
        examples = processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
        features = convert_examples_to_features(examples,
                                                tokenizer,
                                                label_list=label_list,
                                                max_length=args.max_seq_length,
                                                output_mode=output_mode,
                                                pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
                                                pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
                                                pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
        )
        if args.local_rank in [-1, 0]:
            logger.info("Saving features into cached file %s", cached_features_file)
            torch.save(features, cached_features_file)

    if args.local_rank == 0 and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
    if output_mode == "classification":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
    elif output_mode == "regression":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)

    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
    return dataset

3. 应用动态量化¶

我们呼吁模型申请 HuggingFace BERT 模型的动态量化。具体说来torch.quantization.quantize_dynamic

我们指定希望模型中的 torch.nn.Linear 模块能够被量化;
我们指定希望将权重转换为量化的 int8 值。

quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)
print(quantized_model)

3.1 检查模型大小¶

让我们首先检查模型大小。我们可以观察到显著的减少在模型大小（FP32 总大小：438 MB;INT8 总大小：181 MB）：

def print_size_of_model(model):
    torch.save(model.state_dict(), "temp.p")
    print('Size (MB):', os.path.getsize("temp.p")/1e6)
    os.remove('temp.p')

print_size_of_model(model)
print_size_of_model(quantized_model)

本教程（）中使用的 BERT 模型具有词汇量 V 为 30522。当嵌入大小为 768 时，总单词嵌入表的大小为 ~ 4（字节/FP32）* 30522 * 768 = 90 MB。因此，在量化的帮助下，模型大小的非嵌入表格部分从 350 MB（FP32 模型）减少到 90 MB （INT8 型号）。bert-base-uncased

3.2 评估推理精度和时间¶

接下来，我们来比较一下推理时间和评估原始 FP32 模型与 INT8 模型之间的精度动态量化。

def time_model_evaluation(model, configs, tokenizer):
    eval_start_time = time.time()
    result = evaluate(configs, model, tokenizer, prefix="")
    eval_end_time = time.time()
    eval_duration_time = eval_end_time - eval_start_time
    print(result)
    print("Evaluate total time (seconds): {0:.1f}".format(eval_duration_time))

# Evaluate the original FP32 BERT model
time_model_evaluation(model, configs, tokenizer)

# Evaluate the INT8 BERT model after the dynamic quantization
time_model_evaluation(quantized_model, configs, tokenizer)

在 MacBook Pro 上本地运行，无需量化、推理（对于 MRPC 数据集中的所有 408 个示例）大约需要 160 秒，而使用量化它只需要大约 90 秒。我们总结了结果在 Macbook Pro 上运行量化的 BERT 模型推理，作为遵循：

| Prec | F1 score | Model Size | 1 thread | 4 threads |
| FP32 |  0.9019  |   438 MB   | 160 sec  | 85 sec    |
| INT8 |  0.902   |   181 MB   |  90 sec  | 46 sec    |

在应用训练后动态后，我们的 F1 分数准确性降低了 0.6% MRPC 任务上微调的 BERT 模型的量化。作为比较，在最近的一篇论文中（表 1），它达到 0.8788 应用训练后动态量化和 0.8956 量化感知训练。主要区别在于我们支持 PyTorch 中的非对称量化，而该论文支持仅限对称量化。

请注意，我们将单线程的线程数设置为 1 比较。我们还支持内部作这些量化 INT8 运算符的并行化。用户现在可以 set multi-thread by （是内部作并行化线程）。启用作内并行化支持是使用正确的后端（如 OpenMP、Native 或 TBB）构建 PyTorch。您可以使用它来检查并行化设置。在同一台 MacBook Pro 上，使用 PyTorch 和用于并行化的原生后端，我们可以得到大约 46 秒的处理 MRPC 数据集的评估。torch.set_num_threads(N)Ntorch.__config__.parallel_info()

3.3 序列化量化模型¶

在跟踪模型后，我们可以使用 torch.jit.save 序列化并保存量化模型以备将来使用。

def ids_tensor(shape, vocab_size):
    #  Creates a random int32 tensor of the shape within the vocab size
    return torch.randint(0, vocab_size, shape=shape, dtype=torch.int, device='cpu')

input_ids = ids_tensor([8, 128], 2)
token_type_ids = ids_tensor([8, 128], 2)
attention_mask = ids_tensor([8, 128], vocab_size=2)
dummy_input = (input_ids, attention_mask, token_type_ids)
traced_model = torch.jit.trace(quantized_model, dummy_input)
torch.jit.save(traced_model, "bert_traced_eager_quant.pt")

要加载量化模型，我们可以使用 torch.jit.load

loaded_quantized_model = torch.jit.load("bert_traced_eager_quant.pt")

结论¶

在本教程中，我们演示了如何将将 BERT 等知名的最先进的 NLP 模型转换为动态量化型。动态量化可以减小模型的大小，而只有对准确性的影响有限。

感谢阅读！一如既往，我们欢迎任何反馈，因此请创建如果你有任何。

引用¶

[1] J.Devlin、M. Chang、K. Lee 和 K. Toutanova， BERT：预训练用于语言理解的深度双向转换器（2018）。

[2] HuggingFace 变形金刚。

[3] O. Zafrir、G. Boudoukh、P. Izsak 和 M. Wasserblat （2019）。Q8BERT：量化 8 位 BERT。