torchtune 中的 Meta Llama3¶

您将学习如何：

下载 Llama3-8B-Instruct 权重和分词器
使用 LoRA 和 QLoRA 微调 Llama3-8B-Instruct
评估微调的 Llama3-8B-Instruct 模型
使用微调的模型生成文本
量化模型以加快生成速度

先决条件

熟悉 torchtune
确保安装 torchtune

美洲驼3-8B¶

Meta Llama 3 是 Meta AI 发布的新模型系列，它改进了 Llama2 系列的性能的模型。目前有两种不同尺寸的 Meta Llama 3：8B 和 70B。在本教程中，我们将重点介绍 8B 大小的模型。 Llama2-7B 和 Llama3-8B 模型之间有一些主要变化：

Llama3-8B 使用分组查询注意力，而不是 Llama2-7B 的标准多头注意力
Llama3-8B 的词汇量更大（128,256 个，而不是 Llama2 模型的 32,000 个）
Llama3-8B 使用与 Llama2 模型不同的分词器（tiktoken 而不是 sentencepiece)
Llama3-8B 在其 MLP 层中使用比 Llama2-7B 更大的中间维度
Llama3-8B 使用更高的基值来计算其旋转位置嵌入中的 theta

访问 Llama3-8B-Instruct¶

在本教程中，我们将使用 Llama3-8B 的指令调整版本。首先，让我们从 Hugging Face 下载模型。您需要按照说明进行作在官方 Meta 页面上访问该模型。接下来，确保您从这里获取您的 Hugging Face 代币。

tune download meta-llama/Meta-Llama-3-8B-Instruct \
    --output-dir <checkpoint_dir> \
    --hf-token <ACCESS TOKEN>

在 torchtune 中微调 Llama3-8B-Instruct¶

torchtune 提供 LoRA、QLoRA 和完全微调在一个或多个 GPU 上微调 Llama3-8B 的配方。有关 torchtune 中 LoRA 的更多信息，请参阅我们的 LoRA 教程。有关 torchtune 中 QLoRA 的更多信息，请参阅我们的 QLoRA 教程。

让我们来看看如何使用 torchtune 在单个设备上使用 LoRA 微调 Llama3-8B-Instruct。在此示例中，我们将微调用于说明目的的通用 InStruct 数据集上的一个 epoch。单设备 LoRA 微调的基本命令是

tune run lora_finetune_single_device --config llama3/8B_lora_single_device

注意

要查看配方及其相应配置的完整列表，只需从命令行运行即可。tune ls

我们还可以根据需要添加命令行覆盖，例如

tune run lora_finetune_single_device --config llama3/8B_lora_single_device \
    checkpointer.checkpoint_dir=<checkpoint_dir> \
    tokenizer.path=<checkpoint_dir>/tokenizer.model \
    checkpointer.output_dir=<checkpoint_dir>

这将从上述命令中使用的 Llama3-8B-Instruct 检查点和分词器中加载，然后，按照原始格式在同一目录中保存最终检查点。有关 Torchtune 支持的检查点格式，请参阅我们的检查点深入探讨。<checkpoint_dir>tune download

注意

要查看此（和其他）配置的完整可配置参数集，我们可以使用它来复制（和修改）默认配置。也可以与配方脚本一起使用，以防您想要进行更多自定义更改这无法通过直接修改现有的可配置参数来实现。有关更多信息，请参阅修改配置部分。tune cptune cptune cp

训练完成后，将保存模型检查点并记录其位置。为 LoRA 微调时，最终检查点将包含合并的权重，以及仅（小得多的）LoRA 权重的副本将单独保存。

在我们的实验中，我们观察到峰值内存使用量为 18.5 GB。默认配置可以在具有 24 GB VRAM 的消费类 GPU 上进行训练。

如果您有多个可用的 GPU，则可以运行配方的分布式版本。 torchtune 使用 PyTorch Distributed 的 FSDP API 对模型、优化器状态和梯度进行分片。这应该使您能够增加批量大小，从而加快整体训练速度。例如，在两台设备上：

tune run --nproc_per_node 2 lora_finetune_distributed --config llama3/8B_lora

最后，如果我们想使用更少的内存，我们可以通过以下方式利用 torchtune 的 QLoRA 配方：

tune run lora_finetune_single_device --config llama3/8B_qlora_single_device

由于我们的默认配置启用了完整的 bfloat16 训练，因此上述所有命令都可以使用设备至少具有 24 GB 的 VRAM，实际上 QLoRA 配方应该具有峰值分配的内存低于 10 GB。您还可以尝试 LoRA 和 QLoRA 的不同配置，甚至可以运行完整的微调。试试看！

使用 EleutherAI 的 Eval Harness 评估微调的 Llama3-8B 模型¶

现在我们已经微调了我们的模型，下一步是什么？让我们从上一节，看看我们可以通过几种不同的方式来评估它在我们关心的任务上的表现。

首先， torchtune 提供了与 EleutherAI 的评估工具的集成，用于对常见基准任务进行模型评估。

注意

确保您首先通过安装了评估工具。pip install "lm_eval==0.4.*"

在本教程中，我们将使用线束中的任务。此任务衡量模型在回答问题时保持真实的倾向，并且测量模型对问题后跟一个或多个 true 的零镜头准确率 responses 和一个或多个 false 响应。首先，让我们复制配置，以便我们可以将 YAML 指向文件添加到我们微调的检查点文件中。truthfulqa_mc2

tune cp eleuther_evaluation ./custom_eval_config.yaml

接下来，我们进行修改以包含微调的检查点。custom_eval_config.yaml

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    meta_model_0.pt
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: <checkpoint_dir>/tokenizer.model

最后，我们可以使用修改后的配置运行 evaluation。

tune run eleuther_eval --config ./custom_eval_config.yaml

亲自尝试一下，看看您的模型的准确性如何！

使用我们微调的 Llama3 模型生成文本¶

接下来，让我们看看评估模型的另一种方法：生成文本！torchtune 也提供了生成方法。

与我们所做的类似，让我们复制并修改默认的生成配置。

tune cp generation ./custom_generation_config.yaml

现在我们修改以指向我们的 checkpoint 和 tokenizer。custom_generation_config.yaml

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    meta_model_0.pt
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: <checkpoint_dir>/tokenizer.model

使用我们的 LoRA 微调模型运行 generation，我们看到以下输出：

tune run generate --config ./custom_generation_config.yaml \
prompt="Hello, my name is"

[generate.py:122] Hello, my name is Sarah and I am a busy working mum of two young children, living in the North East of England.
...
[generate.py:135] Time for inference: 10.88 sec total, 18.94 tokens/sec
[generate.py:138] Bandwidth achieved: 346.09 GB/s
[generate.py:139] Memory used: 18.31 GB

通过量化更快地生成¶

我们可以看到，该模型花费了不到 11 秒的时间，每秒生成近 19 个令牌。我们可以通过量化模型来加快这一速度。在这里，我们将使用仅 4 位权重量化由 Torchao 提供。

如果你已经关注了这么远，你现在就知道怎么做了。让我们复制量化配置并将其指向我们微调的模型。

tune cp quantization ./custom_quantization_config.yaml

并使用以下内容进行更新：custom_quantization_config.yaml

# Model arguments
model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    meta_model_0.pt
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

为了量化模型，我们现在可以运行：

tune run quantize --config ./custom_quantization_config.yaml

[quantize.py:90] Time for quantization: 2.93 sec
[quantize.py:91] Memory used: 23.13 GB
[quantize.py:104] Model checkpoint of size 4.92 GB saved to /tmp/Llama-3-8B-Instruct-hf/consolidated-4w.pt

我们可以看到，模型现在不到 5 GB，或者每个 8B 参数都略高于 4 位。

注意

与微调的检查点不同，量化配方输出单个检查点文件。这是因为我们的量化 API 目前不支持任何跨格式的转换。因此，您将无法在 torchtune 之外使用这些量化模型。但是您应该能够将这些与生成和评估配方 torchtune 的这些结果将有助于了解您应该使用哪些量化方法替换为您最喜欢的推理引擎。

让我们采用量化模型并再次运行相同的生成。首先，我们将对 .custom_generation_config.yaml

checkpointer:
  # we need to use the custom torchtune checkpointer
  # instead of the HF checkpointer for loading
  # quantized models
  _component_: torchtune.utils.FullModelTorchTuneCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files point to the quantized model
  checkpoint_files: [
    consolidated-4w.pt,
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# we also need to update the quantizer to what was used during
# quantization
quantizer:
  _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
  groupsize: 256

让我们重新运行生成！

tune run generate --config ./custom_generation_config.yaml \
prompt="Hello, my name is"

[generate.py:122] Hello, my name is Jake.
I am a multi-disciplined artist with a passion for creating, drawing and painting.
...
Time for inference: 1.62 sec total, 57.95 tokens/sec

通过量化模型并运行，我们获得了超过 3 倍的加速！torch.compile

这只是您可以使用 torchtune 和更广泛的生态系统对 Meta Llama3 执行的作的开始。我们期待看到您构建的内容！