Meta Llama3 in torchtune¶

您将学习如何：

下载 Llama3-8B-Instruct 的权重和分词器
使用 LoRA 和 QLoRA 微调 Llama3-8B-Instruct
评估您微调后的 Llama3-8B-Instruct 模型
使用您微调后的模型生成文本
量化您的模型以加速生成

先决条件

熟悉 torchtune
确保已安装 torchtune

Llama3-8B¶

Meta Llama 3 是 Meta AI 发布的新一代模型系列，相较于 Llama2 系列模型，在多个不同基准测试中提升了性能表现。目前 Meta Llama 3 共有两种不同规模的版本：8B 和 70B。本教程将重点介绍 8B 规模的模型。 Llama2-7B 和 Llama3-8B 模型之间主要有以下几项主要改进：

Llama3-8B 使用了分组查询注意力，而不是 Llama2-7B 中的标准多头注意力。
Llama3-8B 拥有更大的词表规模（128,256，而 Llama2 模型为 32,000）
Llama3-8B 使用了与 Llama2 模型不同的分词器（tiktoken 而不是 sentencepiece）
Llama3-8B 在其 MLP 层中使用了比 Llama2-7B 更大的中间维度。
Llama3-8B 使用更高的基础值来计算其旋转位置嵌入（rotary positional embeddings）中的 theta

获取对 Llama3-8B-Instruct 的访问权限¶

对于本教程，我们将使用Llama3-8B的指令调优版本。首先，让我们从Hugging Face下载模型。你需要按照 Meta官方页面上的说明来获取对模型的访问权限。接下来，请确保从这里获取你的Hugging Face令牌。

tune download meta-llama/Meta-Llama-3-8B-Instruct \
    --output-dir <checkpoint_dir> \
    --hf-token <ACCESS TOKEN>

在torchtune中微调Llama3-8B-Instruct¶

torchtune 提供了用于在一台或多台 GPU 上微调 Llama3-8B 的 LoRA、QLoRA 和全微调配方。有关 torchtune 中 LoRA 的更多信息，请参阅我们的 LoRA 教程。有关 torchtune 中 QLoRA 的更多信息，请参阅我们的 QLoRA 教程。

让我们看看如何使用torchtune在单个设备上对Llama3-8B-Instruct进行LoRA微调。在这个例子中，我们将为了说明目的，在一个常见的指令数据集上进行一个周期的微调。单设备LoRA微调的基本命令是

tune run lora_finetune_single_device --config llama3/8B_lora_single_device

注意

要查看所有配方及其对应的配置列表，只需在命令行中运行 tune ls。

我们还可以根据需要添加命令行覆盖，例如

tune run lora_finetune_single_device --config llama3/8B_lora_single_device \
    checkpointer.checkpoint_dir=<checkpoint_dir> \
    tokenizer.path=<checkpoint_dir>/tokenizer.model \
    checkpointer.output_dir=<checkpoint_dir>

这将从 <checkpoint_dir> 加载 Llama3-8B-Instruct 检查点和分词器，该值用于上述的 tune download 命令，然后按照原始格式在相同目录中保存最终的检查点。有关 torchtune 支持的检查点格式的更多详细信息，请参阅我们的检查点深入解析。

注意

要查看此（以及其他）配置的完整可配置参数集，我们可以使用 tune cp 来复制（并修改）默认配置。tune cp 也可以与配方脚本一起使用，以便在无法直接修改现有可配置参数的情况下进行更自定义的更改。有关 tune cp 的更多信息，请参阅我们在“修改配置”部分中的“微调你的第一个大型语言模型”教程。

训练完成后，模型检查点将被保存，并且它们的位置会被记录下来。对于LoRA微调，最终的检查点将包含合并后的权重，同时一个仅包含（小得多的）LoRA权重的副本也会被单独保存。

在我们的实验中，我们观察到峰值内存使用量为 18.5 GB。默认配置可在配备 24 GB 显存的消费级 GPU 上进行训练。

如果你有多个可用的GPU，你可以运行该配方的分布式版本。 torchtune 利用了 PyTorch Distributed 的 FSDP API 来拆分模型、优化器状态和梯度。这应该能够让你增加批量大小，从而实现更快的整体训练速度。例如，在两个设备上：

tune run --nproc_per_node 2 lora_finetune_distributed --config llama3/8B_lora

最后，如果我们希望进一步减少内存占用，可以通过以下方式利用 torchtune 的 QLoRA 配方：

tune run lora_finetune_single_device --config llama3/8B_qlora_single_device

由于我们的默认配置启用了完整的 bfloat16 训练，因此上述所有命令都可以在至少拥有 24 GB VRAM 的设备上运行，实际上 QLoRA 配方的峰值分配内存应低于 10 GB。您还可以尝试不同的 LoRA 和 QLoRA 配置，甚至进行完整的微调。试试看！

评估使用EleutherAI的Eval Harness微调的Llama3-8B模型¶

现在我们已经对模型进行了微调，接下来该怎么办呢？让我们从前面部分的LoRA微调模型出发，看看几种不同的方法来评估它在我们关心的任务上的表现。

首先，torchtune 提供了与 EleutherAI 的评估工具包的集成，用于在常见基准任务上对模型进行评估。

注意

请确保您已通过 pip install "lm_eval==0.4.*" 安装了评估工具包。

在这个教程中，我们将使用来自测试套件的 truthfulqa_mc2 任务。该任务衡量模型在回答问题时保持诚实的倾向，并测量模型在问题后跟随一个或多个正确答案和一个或多个错误答案的情况下进行零样本准确性的能力。首先，让我们复制配置，以便我们可以将 YAML 文件指向我们的微调检查点文件。

tune cp eleuther_evaluation ./custom_eval_config.yaml

接下来，我们将 custom_eval_config.yaml 修改为包含微调后的检查点。

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.training.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    meta_model_0.pt
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: <checkpoint_dir>/tokenizer.model

最后，我们可以使用修改后的配置文件运行评估。

tune run eleuther_eval --config ./custom_eval_config.yaml

亲自试一试，看看你的模型能达到怎样的准确率！

使用我们微调的Llama3模型生成文本¶

接下来，让我们看看另一种评估模型的方法：生成文本！torchtune 提供了一个生成的配方。

与我们所做的类似，让我们复制并修改默认生成配置。

tune cp generation ./custom_generation_config.yaml

现在我们将 custom_generation_config.yaml 修改为指向我们的检查点和分词器。

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.training.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    meta_model_0.pt
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: <checkpoint_dir>/tokenizer.model

使用我们经过 LoRA 微调的模型运行生成，可以看到以下输出：

tune run generate --config ./custom_generation_config.yaml \
prompt="Hello, my name is"

[generate.py:122] Hello, my name is Sarah and I am a busy working mum of two young children, living in the North East of England.
...
[generate.py:135] Time for inference: 10.88 sec total, 18.94 tokens/sec
[generate.py:138] Bandwidth achieved: 346.09 GB/s
[generate.py:139] Memory used: 18.31 GB

通过量化实现更快的生成¶

我们依赖 torchao 进行训练后量化。在安装 torchao 后，我们可以运行以下命令来量化微调后的模型：

# we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
# https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
# for a full list of techniques that we support
from torchao.quantization.quant_api import quantize_, int4_weight_only
quantize_(model, int4_weight_only())

量化后，我们依赖 torch.compile 来实现性能提升。详情请参阅此示例用法。

torchao 还提供了此表格，列出了 llama2 和 llama3 的性能和准确率结果。

对于Llama模型，您可以直接在量化模型上使用他们的generate.py脚本通过torchao进行生成，如此readme中所述。这样您就可以将您的结果与之前链接的表格中的结果进行比较。

这只是使用 torchtune 和更广泛的生态系统与 Meta Llama3 可以做的事情的开始。我们期待看到你构建的作品！