使用聊天数据微调 Llama3¶

Llama3 Instruct 引入了一个新的提示模板，用于对聊天数据进行微调。在本教程中，我们将介绍您需要了解的内容，以便您快速开始准备自己的用于微调 Llama3 Instruct 的自定义聊天数据集。

您将学习：

Llama3 Instruct 格式与 Llama2 有何不同
关于提示模板和特殊令牌的所有信息
如何使用自己的聊天数据集来微调 Llama3 Instruct

先决条件

熟悉配置数据集
知道如何下载 Llama3 Instruct 权重

模板从 Llama2 更改为 Llama3¶

Llama2 聊天模型在提示预训练型。由于聊天模型是使用此提示模板进行预训练的，因此如果要运行推理，您需要使用相同的模板以获得最佳性能关于聊天数据。否则，模型将只执行标准文本补全，即可能符合也可能不一致。

来自 Llama2 官方提示 template 指南中，我们可以看到添加了特殊的标签：

<s>[INST] <<SYS>>
You are a helpful, respectful, and honest assistant.
<</SYS>>

Hi! I am a human. [/INST] Hello there! Nice to meet you! I'm Meta AI, your friendly AI assistant </s>

Llama3 Instruct 对 Llama2 中的模板进行了全面修改，以更好地支持多轮对话。相同的文本在 Llama3 Instruct 格式中，将如下所示：

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful, respectful, and honest assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Hi! I am a human.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hello there! Nice to meet you! I'm Meta AI, your friendly AI assistant<|eot_id|>

这些标签完全不同，它们实际上的编码方式与美洲驼2.让我们来演练一下使用 Llama2 模板和 llama3 模板来了解如何作。

注意

Llama3 Base 模型使用与 Llama3 Instruct 不同的提示模板因为它还没有被 imdirect tuned 并且额外的特殊标记是未经训练的。如果你在 Llama3 Base 模型上运行推理而不进行微调，我们建议使用 Base 模板以获得最佳性能。通常，对于 instruct 和 chat 数据，我们建议使用 Llama3 使用其提示模板进行指示。本教程的其余部分假定您使用的是 Llama3 指示。

标记提示模板和特殊令牌¶

假设我有一个用户-助手轮次的样本，并附有一个系统提示：

sample = [
    {
        "role": "system",
        "content": "You are a helpful, respectful, and honest assistant.",
    },
    {
        "role": "user",
        "content": "Who are the most influential hip-hop artists of all time?",
    },
    {
        "role": "assistant",
        "content": "Here is a list of some of the most influential hip-hop "
        "artists of all time: 2Pac, Rakim, N.W.A., Run-D.M.C., and Nas.",
    },
]

现在，让我们使用Llama2ChatTemplateclass 和了解它是如何被标记化的。Llama2ChatTemplate 是提示模板的一个示例，它只是使用风味文本构建一个提示来指示某个任务。

from torchtune.data import Llama2ChatTemplate, Message

messages = [Message.from_dict(msg) for msg in sample]
formatted_messages = Llama2ChatTemplate.format(messages)
print(formatted_messages)
# [
#     Message(
#         role='user',
#         content='[INST] <<SYS>>\nYou are a helpful, respectful, and honest assistant.\n<</SYS>>\n\nWho are the most influential hip-hop artists of all time? [/INST] ',
#         ...,
#     ),
#     Message(
#         role='assistant',
#         content='Here is a list of some of the most influential hip-hop artists of all time: 2Pac, Rakim, N.W.A., Run-D.M.C., and Nas.',
#         ...,
#     ),
# ]

还有一些 Llama2 使用的特殊令牌，这些令牌不在提示模板中。如果您查看我们的Llama2ChatTemplate类中，您会注意到我们不包括 and 令牌。这些是序列的开头（BOS）和序列结束（EOS）令牌，它们在分词器中以不同的方式表示而不是提示模板的其余部分。让我们用<s></s>llama2_tokenizer()由 Llama2 用于查看为什么。

from torchtune.models.llama2 import llama2_tokenizer

tokenizer = llama2_tokenizer("/tmp/Llama-2-7b-hf/tokenizer.model")
user_message = formatted_messages[0].text_content
tokens = tokenizer.encode(user_message, add_bos=True, add_eos=True)
print(tokens)
# [1, 518, 25580, 29962, 3532, 14816, 29903, 6778, ..., 2]

我们在对示例文本进行编码时添加了 BOS 和 EOS 令牌。这会显示出来作为 ID 1 和 2。我们可以验证这些是我们的 BOS 和 EOS 代币。

print(tokenizer._spm_model.spm_model.piece_to_id("<s>"))
# 1
print(tokenizer._spm_model.spm_model.piece_to_id("</s>"))
# 2

BOS 和 EOS 代币就是我们所说的特殊代币，因为它们有自己的预留令牌 ID。这意味着它们将在模型的 learnt embedding 表。其余的 prompt 模板标记，并被标记为普通文本，而不是它们自己的 ID。[INST]<<SYS>>

print(tokenizer.decode(518))
# '['
print(tokenizer.decode(25580))
# 'INST'
print(tokenizer.decode(29962))
# ']'
print(tokenizer.decode([3532, 14816, 29903, 6778]))
# '<<SYS>>'

请务必注意，您不应将特殊的预留代币放在 input 提示，因为它将被视为普通文本，而不是特殊文本令牌。

print(tokenizer.encode("<s>", add_bos=False, add_eos=False))
# [529, 29879, 29958]

现在让我们看一下 Llama3 的格式，看看它是如何以不同的方式进行标记化的比 Llama2.

from torchtune.models.llama3 import llama3_tokenizer

tokenizer = llama3_tokenizer("/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model")
messages = [Message.from_dict(msg) for msg in sample]
tokens, mask = tokenizer.tokenize_messages(messages)
print(tokenizer.decode(tokens))
# '<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, respectful,
# and honest assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho
# are the most influential hip-hop artists of all time?<|eot_id|><|start_header_id|>
# assistant<|end_header_id|>\n\nHere is a list of some of the most influential hip-hop
# artists of all time: 2Pac, Rakim, N.W.A., Run-D.M.C., and Nas.<|eot_id|>'

注意

我们使用了 Llama3 的 API，它与编码。它只是管理以正确的 places 在对单个消息进行编码之后。tokenize_messages

我们可以看到 tokenizer 处理了所有格式设置，而无需我们指定提示模板。事实证明，所有额外的标签都是特殊的标记，我们不需要单独的提示模板。我们可以通过检查标签是否被编码来验证这一点作为他们自己的令牌 ID 来获取。

print(tokenizer.special_tokens["<|begin_of_text|>"])
# 128000
print(tokenizer.special_tokens["<|eot_id|>"])
# 128009

最好的部分是 - 所有这些特殊令牌都完全由 tokenizer 处理。这意味着您不必担心弄乱任何必需的提示模板！

何时应使用提示模板？¶

是否使用提示模板取决于您所需的推理行为是。如果您在 base 模型，并且它是使用提示模板进行预训练的，或者您想要准备一个微调模型，以期望特定任务的推理具有特定的提示结构。

使用提示模板进行微调并非绝对必要，但通常特定任务将需要特定模板。例如，SummarizeTemplate提供轻量级结构，以便为要求汇总文本的提示准备微调模型。这将环绕用户消息，而助手消息保持不变。

f"Summarize this dialogue:\n{dialogue}\n---\nSummary:\n"

您可以使用此模板微调 Llama2，即使该模型最初是预先训练的使用Llama2ChatTemplate，只要这是模型在推理期间看到。该模型应该足够健壮，以适应新模板。

对自定义聊天数据集进行微调¶

让我们尝试使用自定义 chat 数据集。我们将介绍如何设置数据，以便对其进行标记化正确地输入到我们的模型中。

假设我们有一个保存为包含对话的 JSON 文件的本地数据集使用 AI 模型。我们怎么能把这样的东西变成一个格式 Llama3 正确理解和标记化？

# data/my_data.json
[
    {
        "dialogue": [
            {
                "from": "human",
                "value": "What is your name?"
            },
            {
                "from": "gpt",
                "value": "I am an AI assistant, I don't have a name."
            },
            {
                "from": "human",
                "value": "Pretend you have a name."
            },
            {
                "from": "gpt",
                "value": "My name is Mark Zuckerberg."
            }
        ]
    },
]

我们首先看一下 Generic 数据集生成器，看看哪些适合我们的用例。由于我们拥有对话数据，chat_dataset()似乎很合适。对于任何自定义本地数据集，我们始终需要为任何数据集指定、和 Torchtune 中的 builder 中。为sourcedata_filessplitchat_dataset()，我们还需要指定和。我们的数据遵循格式，因此我们可以在这里指定。总而言之，我们的conversation_columnconversation_style"sharegpt"chat_dataset()call 应该看起来像这样：

from torchtune.datasets import chat_dataset
from torchtune.models.llama3 import llama3_tokenizer

tokenizer = llama3_tokenizer("/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model")
ds = chat_dataset(
    tokenizer=tokenizer,
    source="json",
    data_files="data/my_data.json",
    split="train",
    conversation_column="dialogue",
    conversation_style="sharegpt",
)

# In config
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model

dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  data_files: data/my_data.json
  split: train
  conversation_column: dialogue
  conversation_style: sharegpt

注意

你可以将 load_dataset 的任何 keyword 参数传递到我们所有的 Dataset 类，它们将遵循这些类。这对于常用参数很有用例如，使用或 configure with 指定数据分割splitname

如果您需要添加提示模板，只需将其传递到分词器中即可。由于我们正在微调 Llama3，因此分词器将处理 US 和 Prompt 模板是可选的。其他型号，如 Mistral 的，默认使用聊天模板（MistralTokenizerMistralChatTemplate）以格式化所有消息都根据他们的建议。

现在我们准备好开始微调了！我们将使用内置的 LoRA 单个设备配方。使用 tune cp 命令获取配置的副本，并使用您的数据集配置对其进行更新。8B_lora_single_device.yaml

启动 fine-tune！

$ tune run lora_finetune_single_device --config custom_8B_lora_single_device.yaml epochs=15