Chat Datasets¶
聊天数据集涉及用户与助手之间的多轮对话(多次来回交流)。
[
{"role": "user", "content": "What is the answer to the ultimate question of life?"},
{"role": "assistant", "content": "The answer is 42."},
{"role": "user", "content": "That's ridiculous"},
{"role": "assistant", "content": "Oh I know."},
]
这比模型通常预训练所用的自由形式文本关联更具结构化,后者仅学习预测下一个标记,而非准确响应用户。
微调torchtune中聊天数据集的主要入口是chat_dataset()构建器。这允许您直接从配置中指定遵循聊天数据格式的本地或Hugging Face数据集,并在上面训练您的LLM。
示例聊天数据集¶
# data/my_data.json
[
{
"conversations": [
{
"from": "human",
"value": "What is the answer to life?"
},
{
"from": "gpt",
"value": "The answer is 42."
},
{
"from": "human",
"value": "That's ridiculous"
},
{
"from": "gpt",
"value": "Oh I know."
}
]
}
]
from torchtune.models.mistral import mistral_tokenizer
from torchtune.datasets import chat_dataset
m_tokenizer = mistral_tokenizer(
path="/tmp/Mistral-7B-v0.1/tokenizer.model",
prompt_template="torchtune.models.mistral.MistralChatTemplate",
max_seq_len=8192,
)
ds = chat_dataset(
tokenizer=m_tokenizer,
source="json",
data_files="data/my_data.json",
split="train",
conversation_column="conversations",
conversation_style="sharegpt",
# By default, user prompt is ignored in loss. Set to True to include it
train_on_input=True,
new_system_prompt=None,
)
tokenized_dict = ds[0]
tokens, labels = tokenized_dict["tokens"], tokenized_dict["labels"]
print(m_tokenizer.decode(tokens))
# [INST] What is the answer to life? [/INST] The answer is 42. [INST] That's ridiculous [/INST] Oh I know.
print(labels)
# [1, 733, 16289, 28793, 1824, 349, 272, 4372, ...]
# In config
tokenizer:
_component_: torchtune.models.mistral.mistral_tokenizer
path: /tmp/Mistral-7B-v0.1/tokenizer.model
prompt_template: torchtune.models.mistral.MistralChatTemplate
max_seq_len: 8192
dataset:
_component_: torchtune.datasets.chat_dataset
source: json
data_files: data/my_data.json
split: train
conversation_column: conversations
conversation_style: sharegpt
train_on_input: True
new_system_prompt: null
Chat数据集格式¶
聊天数据集通常包含一个名为“conversations”或“messages”的单一列,其中每行样本存储了围绕单个主题的对话消息列表。该消息列表可包含系统提示、用户与助手之间的多轮对话,以及工具调用及其返回结果。
| conversations |
|--------------------------------------------------------------|
| [{"role": "user", "content": "What day is today?"}, |
| {"role": "assistant", "content": "It is Tuesday."}] |
| [{"role": "user", "content": "What about tomorrow?"}, |
| {"role": "assistant", "content": "Tomorrow is Wednesday."}] |
例如,您可以查看 SlimOrca 数据集 的模式。
从Hugging Face加载聊天数据集¶
你需要将数据集仓库名称传递给source,在conversation_style中选择一种对话样式,并指定conversation_column。
对于大多数HF数据集,你还需要指定split。
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
tokenizer=g_tokenizer,
source="Open-Orca/SlimOrca-Dedup",
conversation_column="conversations",
conversation_style="sharegpt",
split="train",
)
# Tokenizer is passed into the dataset in the recipe
dataset:
_component_: torchtune.datasets.chat_dataset
source: Open-Orca/SlimOrca-Dedup
conversation_column: conversations
conversation_style: sharegpt
split: train
加载本地和远程聊天数据集¶
要通过HTTPS加载本地或远程的包含对话数据的数据集,还需要指定data_files和split参数。有关加载本地或远程文件的更多详细信息,请参阅Hugging Face的load_dataset文档。
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
tokenizer=g_tokenizer,
source="json",
conversation_column="conversations",
conversation_style="sharegpt",
data_files="data/my_data.json",
split="train",
)
# Tokenizer is passed into the dataset in the recipe
dataset:
_component_: torchtune.datasets.chat_dataset
source: json
conversation_column: conversations
conversation_style: sharegpt
data_files: data/my_data.json
split: train
指定对话风格¶
原始数据集中对话的结构可能因不同的角色名称和不同的字段(指示消息内容名称)而有很大差异。许多数据集都有一些标准化的格式。我们已经内置了转换器,可以将这些标准化格式转换为遵循此格式的 `torchtune` 列表:
[
{
"role": "system" | "user" | "assistant" | "ipython",
"content": <message>,
},
...
]
"openai"¶
相关的消息转换是 OpenAIToMessages。预期的格式为:
{
"messages": [
{
"role": "system" | "user" | "assistant",
"content": <message>,
},
...
]
}
你可以在代码或配置中指定 conversation_style=openai:
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
tokenizer=g_tokenizer,
source="json",
conversation_column="conversations",
conversation_style="openai",
data_files="data/my_data.json",
split="train",
)
# Tokenizer is passed into the dataset in the recipe
dataset:
_component_: torchtune.datasets.chat_dataset
source: json
conversation_column: conversations
conversation_style: openai
data_files: data/my_data.json
split: train
如果您的数据集不符合上述任一对话风格,则需要创建自定义消息转换函数。
重命名列¶
要指定包含对话数据的列,请使用 conversation_column。
# data/my_data.json
[
{
"dialogue": [
{
"from": "human",
"value": "What is the answer to life?"
},
{
"from": "gpt",
"value": "The answer is 42."
},
{
"from": "human",
"value": "That's ridiculous"
},
{
"from": "gpt",
"value": "Oh I know."
}
]
}
]
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
tokenizer=g_tokenizer,
source="json",
conversation_column="dialogue",
conversation_style="sharegpt",
data_files="data/my_data.json",
split="train",
)
# Tokenizer is passed into the dataset in the recipe
dataset:
_component_: torchtune.datasets.chat_dataset
source: json
conversation_column: dialogue
conversation_style: sharegpt
data_files: data/my_data.json
split: train
聊天模板¶
聊天模板的定义方式与说明模板在 instruct_dataset() 中相同。有关更多信息,请参阅 说明模板。