使用 torchtune 的端到端工作流¶

在本教程中，我们将通过一个端到端示例来说明如何微调评估，选择性地量化，然后使用您最喜欢的 LLM 运行生成 torchtune 的我们还将介绍如何使用一些流行的工具和库来自 TorchTune 的社区。

本教程将涵盖以下内容：

torchtune 中可用的不同类型的配方，超越微调
连接所有这些配方的端到端示例
可与 torchtune 一起使用的不同工具和库

先决条件

概述¶

微调 LLM 通常只是大型工作流程中的一个步骤。一个示例工作流，您可能如下所示：

从 HF Hub 下载常用型号
使用相关的微调技术对模型进行微调。使用的确切技术将取决于模型、训练数据的数量和性质、您的硬件等因素 setup 和模型将用于的结束任务
根据一些基准评估模型以验证模型质量
运行一些代以确保模型输出看起来合理
量化模型以实现高效推理
[可选]导出特定环境的模型，例如手机上的推理

在本教程中，我们将介绍如何使用 torchtune 完成上述所有作，利用与生态系统中的流行工具和库集成。

在本教程中，我们将使用 Llama2 7B 模型。您可以找到一整套支持的模型由 torchtune 在这里。

下载 Llama2 7B¶

在本教程中，我们将使用 Llama2 7B 模式的 Hugging Face 模型权重。有关检查点格式以及如何在 torchtune 中处理这些格式的更多信息，请查看本教程介绍了 checkpoints。

要下载 HF 格式的 Llama2 7B 模型，我们将使用 tune CLI。

tune download \
meta-llama/Llama-2-7b-hf \
--output-dir <checkpoint_dir> \
--hf-token <ACCESS TOKEN>

记下，我们将在本教程中多次使用它。<checkpoint_dir>

使用 LoRA 微调模型¶

在本教程中，我们将使用 LoRA 微调模型。LoRA 是一种参数高效的微调技术，当您没有大量 GPU 内存可供使用时，该技术特别有用。LoRA 系列冻结基本 LLM 并添加非常小比例的可学习参数。这有助于保持与 gradients 和 optimizer 状态相关的内存 low 。使用 torchtune，您应该能够在不到 16GB 的 GPU 内存中使用 bfloat16 微调具有 LoRA 的 Llama2 7B 模型 RTX 3090/4090 的。有关如何使用 LoRA 的更多信息，请查看我们的 LoRA 教程。

我们将使用单设备 LoRA 配方进行微调，并使用默认配置中的标准设置。

这将使用和微调我们的模型。通过这些设置，模型每个 epoch 的峰值内存使用量应为 ~16GB，总训练时间约为 2 小时。我们需要对配置进行一些更改，以确保我们的配方可以访问正确的检查点。batch_size=2dtype=bfloat16

让我们使用 tune CLI 为此使用案例寻找正确的配置。

tune ls

RECIPE                                   CONFIG
full_finetune_single_device              llama2/7B_full_low_memory
                                         mistral/7B_full_low_memory
full_finetune_distributed                llama2/7B_full
                                         llama2/13B_full
                                         mistral/7B_full
lora_finetune_single_device              llama2/7B_lora_single_device
                                         llama2/7B_qlora_single_device
                                         mistral/7B_lora_single_device
...

在本教程中，我们将使用 config.llama2/7B_lora_single_device

该配置已指向 HF Checkpointer 和正确的检查点文件。我们需要做的就是更新 model 和 tokenizer 的 Tokenizer 中。让我们在开始训练时使用 tune CLI 中的覆盖来执行此作！

tune run lora_finetune_single_device \
--config llama2/7B_lora_single_device \
checkpointer.checkpoint_dir=<checkpoint_dir> \
tokenizer.path=<checkpoint_dir>/tokenizer.model \
checkpointer.output_dir=<checkpoint_dir>

训练完成后，您将在日志中看到以下内容。

[_checkpointer.py:473] Model checkpoint of size 9.98 GB saved to <checkpoint_dir>/hf_model_0001_0.pt

[_checkpointer.py:473] Model checkpoint of size 3.50 GB saved to <checkpoint_dir>/hf_model_0002_0.pt

[_checkpointer.py:484] Adapter checkpoint of size 0.01 GB saved to <checkpoint_dir>/adapter_0.pt

最终训练的权重与原始模型合并，并拆分为两个 checkpoint 文件类似于 HF Hub 的源检查点（有关更多详细信息，请参阅 LoRA 教程）。事实上，这些 checkpoint 之间的键是相同的。我们还有第三个 checkpoint 文件，它的大小要小得多，并包含学习的 LoRA 适配器权重。在本教程中，我们将只使用模型 checkpoint 而不是 adapter 权重。

使用 EleutherAI 的 Eval Harness 运行评估¶

我们已经对模型进行了微调。但是这个模型到底做得如何呢？让我们运行一些评估！

torchtune 与 EleutherAI 的评估工具集成。这方面的一个例子可以通过配方获得。在本教程中，我们将通过以下方式直接使用此配方修改其关联的 config 。eleuther_evaleleuther_evaluation.yaml

注意

对于本教程的这一部分，您应该首先运行以安装 EleutherAI 评估工具。pip install lm_eval==0.4.*

由于我们计划更新所有 checkpoint 文件以指向我们微调的 checkpoints，让我们首先将配置复制到我们的本地工作目录，以便我们可以进行更改。这比通过 CLI 覆盖所有这些元素更容易。

tune cp eleuther_evaluation ./custom_eval_config.yaml \

在本教程中，我们将使用线束中的任务。此任务衡量模型在回答问题时保持真实的倾向，并且测量模型对问题后跟一个或多个 true 的零镜头准确率 responses 和一个或多个 false 响应。让我们首先运行一个基线，而不进行微调。truthfulqa_mc2

tune run eleuther_eval --config ./custom_eval_config.yaml
checkpointer.checkpoint_dir=<checkpoint_dir> \
tokenizer.path=<checkpoint_dir>/tokenizer.model

[evaluator.py:324] Running loglikelihood requests
[eleuther_eval.py:195] Eval completed in 121.27 seconds.
[eleuther_eval.py:197] truthfulqa_mc2: {'acc,none': 0.388...

该模型的准确率约为 38.8%。让我们将此模型与微调模型进行比较。

首先，我们进行修改以包含微调的检查点。custom_eval_config.yaml

checkpointer:
    _component_: torchtune.utils.FullModelHFCheckpointer

    # directory with the checkpoint files
    # this should match the output_dir specified during
    # finetuning
    checkpoint_dir: <checkpoint_dir>

    # checkpoint files for the fine-tuned model. This should
    # match what's shown in the logs above
    checkpoint_files: [
        hf_model_0001_0.pt,
        hf_model_0002_0.pt,
    ]

    output_dir: <checkpoint_dir>
    model_type: LLAMA2

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
    _component_: torchtune.models.llama2.llama2_tokenizer
    path: <checkpoint_dir>/tokenizer.model

现在，让我们运行配方。

tune run eleuther_eval --config ./custom_eval_config.yaml

结果应如下所示。

[evaluator.py:324] Running loglikelihood requests
[eleuther_eval.py:195] Eval completed in 121.27 seconds.
[eleuther_eval.py:197] truthfulqa_mc2: {'acc,none': 0.489 ...

我们的微调模型在这项任务上获得了 ~48%，即 ~10 分优于基线。伟大！似乎我们的微调有帮助。

代¶

我们已经进行了一些评估，该模型似乎运行良好。但它真的为您关心的提示生成有意义的文本？让我们来了解一下！

为此，我们将使用 generate 配方和关联的配置。

让我们首先将配置复制到我们的本地工作目录，以便我们可以进行更改。

tune cp generation ./custom_generation_config.yaml

让我们进行修改以包含以下更改。custom_generation_config.yaml

checkpointer:
    _component_: torchtune.utils.FullModelHFCheckpointer

    # directory with the checkpoint files
    # this should match the output_dir specified during
    # finetuning
    checkpoint_dir: <checkpoint_dir>

    # checkpoint files for the fine-tuned model. This should
    # match what's shown in the logs above
    checkpoint_files: [
        hf_model_0001_0.pt,
        hf_model_0002_0.pt,
    ]

    output_dir: <checkpoint_dir>
    model_type: LLAMA2

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
    _component_: torchtune.models.llama2.llama2_tokenizer
    path: <checkpoint_dir>/tokenizer.model

更新配置后，让我们开始生成吧！我们将使用使用和的采样的默认设置。这些参数控制计算采样。这些是 Llama2 7B 和我们建议在尝试之前使用这些这些参数。top_k=300temperature=0.8

我们将使用与配置中的提示符不同的提示符

tune run generate --config ./custom_generation_config.yaml \
prompt="What are some interesting sites to visit in the Bay Area?"

生成完成后，您将在日志中看到以下内容。

[generate.py:92] Exploratorium in San Francisco has made the cover of Time Magazine,
                 and its awesome. And the bridge is pretty cool...

[generate.py:96] Time for inference: 11.61 sec total, 25.83 tokens/sec
[generate.py:99] Memory used: 15.72 GB

确实，这座桥很酷！似乎我们的 LLM 对湾区！

使用 Quantization 加速生成¶

我们看到，生成配方大约需要 11.6 秒才能生成 300 个代币。一种常用用于加速推理的技术是量化。torchtune 提供与 TorchAO 量化 API 的集成。让我们首先使用 4 位仅权重量化来量化模型并查看这是否提高了生成速度。

为此，我们将使用 quantization 配方。

让我们首先将配置复制到我们的本地工作目录，以便我们可以进行更改。

tune cp quantization ./custom_quantization_config.yaml

让我们进行修改以包含以下更改。custom_quantization_config.yaml

checkpointer:
    _component_: torchtune.utils.FullModelHFCheckpointer

    # directory with the checkpoint files
    # this should match the output_dir specified during
    # finetuning
    checkpoint_dir: <checkpoint_dir>

    # checkpoint files for the fine-tuned model. This should
    # match what's shown in the logs above
    checkpoint_files: [
        hf_model_0001_0.pt,
        hf_model_0002_0.pt,
    ]

    output_dir: <checkpoint_dir>
    model_type: LLAMA2

一旦配置更新，让我们开始量化吧！我们将使用默认的 quantization 方法。

tune run quantize --config ./custom_quantization_config.yaml

量化完成后，您将在日志中看到以下内容。

[quantize.py:68] Time for quantization: 19.76 sec
[quantize.py:69] Memory used: 13.95 GB
[quantize.py:82] Model checkpoint of size 3.67 GB saved to <checkpoint_dir>/hf_model_0001_0-4w.pt

注意

与微调的 checkpoint 不同，这将输出单个 checkpoint 文件。这是因为我们的量化 API 目前不支持任何跨格式的转换。因此，您将无法在 torchtune 之外使用这些量化模型。但是您应该能够将这些与生成和评估配方 torchtune 的这些结果将有助于了解您应该使用哪些量化方法替换为您最喜欢的推理引擎。

现在我们有了量化模型，让我们重新运行 generation。

修改以包含以下更改。custom_generation_config.yaml

checkpointer:
    # we need to use the custom torchtune checkpointer
    # instead of the HF checkpointer for loading
    # quantized models
    _component_: torchtune.utils.FullModelTorchTuneCheckpointer

    # directory with the checkpoint files
    # this should match the output_dir specified during
    # finetuning
    checkpoint_dir: <checkpoint_dir>

    # checkpoint files point to the quantized model
    checkpoint_files: [
        hf_model_0001_0-4w.pt,
    ]

    output_dir: <checkpoint_dir>
    model_type: LLAMA2

# we also need to update the quantizer to what was used during
# quantization
quantizer:
    _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
    groupsize: 256

更新配置后，让我们开始生成吧！我们将使用采样参数与以前相同。我们还将使用与 unquantized 模型。

tune run generate --config ./custom_generation_config.yaml \
prompt="What are some interesting sites to visit in the Bay Area?"

生成完成后，您将在日志中看到以下内容。

[generate.py:92] A park in San Francisco that sits at the top of a big hill.
                 There are lots of trees and a beautiful view of San Francisco...

[generate.py:96] Time for inference: 4.13 sec total, 72.62 tokens/sec
[generate.py:99] Memory used: 17.85 GB

通过量化（以及后台的 torch 编译），我们加快了生成速度差不多 3 倍！

将 torchtune 检查点与其他库一起使用¶

正如我们上面提到的，处理 checkpoint 的好处之一 conversion 是您可以直接使用标准格式。这有助于与其他库的互作性，因为 Torchtune 尚未添加另一种格式。

让我们看一个示例，了解它如何与流行的代码库一起工作用于使用 LLM 运行高性能推理 - gpt-fast。本节假设您已在计算机上克隆了该存储库。

gpt-fast对 checkpoint 和键到文件的映射，即将参数名称映射到包含它们的文件的文件。让我们通过创建此映射来满足这些假设文件。假设我们将用作目录为了这个。假设带有 checkpoint 的目录具有 HF repo-id 的格式相同。<new_dir>/Llama-2-7B-hfgpt-fast

import json
import torch

# create the output dictionary
output_dict = {"weight_map": {}}

# Load the checkpoints
sd_1 = torch.load('<checkpoint_dir>/hf_model_0001_0.pt', mmap=True, map_location='cpu')
sd_2 = torch.load('<checkpoint_dir>/hf_model_0002_0.pt', mmap=True, map_location='cpu')

# create the weight map
for key in sd_1.keys():
    output_dict['weight_map'][key] =  "hf_model_0001_0.pt"
for key in sd_2.keys():
    output_dict['weight_map'][key] =  "hf_model_0002_0.pt"

with open('<new_dir>/Llama-2-7B-hf/pytorch_model.bin.index.json', 'w') as f:
    json.dump(output_dict, f)

现在我们已经创建了 weight_map，让我们复制我们的检查点。

cp  <checkpoint_dir>/hf_model_0001_0.pt  <new_dir>/Llama-2-7B-hf/
cp  <checkpoint_dir>/hf_model_0002_0.pt  <new_dir>/Llama-2-7B-hf/
cp  <checkpoint_dir>/tokenizer.model     <new_dir>/Llama-2-7B-hf/

设置目录结构后，让我们转换检查点并运行推理！

cd gpt-fast/

# convert the checkpoints into a format readable by gpt-fast
python scripts/convert_hf_checkpoint.py \
--checkpoint_dir <new_dir>/Llama-2-7B-hf/ \
--model 7B

# run inference using the converted model
python generate.py \
--compile \
--checkpoint_path <new_dir>/Llama-2-7B-hf/model.pth \
--device cuda

输出应如下所示：

Hello, my name is Justin. I am a middle school math teacher
at WS Middle School ...

Time for inference 5: 1.94 sec total, 103.28 tokens/sec
Bandwidth achieved: 1391.84 GB/sec

就是这样！试试你自己的提示！

将您的模型上传到 Hugging Face Hub¶

您的新模型运行良好，您希望与世界分享。最简单的方法正在使用与 Torchtune 无缝协作的命令。只需指向 CLI 添加到您的 finetuned 模型目录中，如下所示：huggingface-cli

huggingface-cli upload <hf-repo-id> <checkpoint-dir>

该命令应输出指向 Hub 上的存储库的链接。如果存储库尚不存在，它将自动创建：

https://huggingface.co/<hf-repo-id>/tree/main/.

注意

在上传之前，请确保您通过运行 Hugging Face 进行身份验证。huggingface-cli login

有关该功能的更多详细信息，请查看 Hugging Face 文档。huggingface-cli upload

希望本教程能让你对如何使用 torchtune 有一些了解您自己的工作流程。祝您调音愉快！