目录

目录

文本补全数据集¶

文本补全数据集通常用于持续预训练范式，该范式涉及以自监督方式在无序、无标签的数据集上对基础模型进行微调。

用于在 torchtune 中对文本补全数据集进行微调的主要入口点 text_completion()。文本补全数据集只需包含一个名为"text"的列，其中包含每个样本的文本。

示例本地文本补全数据集¶

`.json` 格式¶

# odyssey.json
[
    {
        "input": "After we were clear of the river Oceanus, and had got out into the open sea, we went on till we reached the Aeaean island where there is dawn and sunrise as in other places. We then drew our ship on to the sands and got out of her on to the shore, where we went to sleep and waited till day should break."
    },
    {
        "input": "Then, when the child of morning, rosy-fingered Dawn, appeared, I sent some men to Circe's house to fetch the body of Elpenor. We cut firewood from a wood where the headland jutted out into the sea, and after we had wept over him and lamented him we performed his funeral rites. When his body and armour had been burned to ashes, we raised a cairn, set a stone over it, and at the top of the cairn we fixed the oar that he had been used to row with."
    }
]

from torchtune.models.llama3 import llama3_tokenizer
from torchtune.datasets import text_completion_dataset

m_tokenizer = llama3_tokenizer(
    path="/tmp/Meta-Llama-3.1-8B/original/tokenizer.model",
    max_seq_len=8192
)

ds = text_completion_dataset(
    tokenizer=m_tokenizer,
    source="json",
    column="input",
    data_files="odyssey.json",
    split="train",
)
tokenized_dict = ds[0]
print(m_tokenizer.decode(tokenized_dict["tokens"]))
# After we were clear of the river Oceanus, and had got out into the open sea,\
# we went on till we reached the Aeaean island where there is dawn and sunrise \
# as in other places. We then drew our ship on to the sands and got out of her on \
# to the shore, where we went to sleep and waited till day should break.
print(tokenized_dict["labels"])
# [128000, 6153, 584, 1051, 2867, 315, 279, 15140, 22302, 355, 11, 323, 1047, \
# 2751, 704, 1139, 279, 1825, 9581, 11, 584, 4024, 389, 12222, 584, 8813, 279, \
# 362, 12791, 5420, 13218, 1405, 1070, 374, 39493, 323, 64919, 439, 304, 1023, \
# 7634, 13, 1226, 1243, 24465, 1057, 8448, 389, 311, 279, 70163, 323, 2751, 704, \
# 315, 1077, 389, 311, 279, 31284, 11, 1405, 584, 4024, 311, 6212, 323, 30315, \
# 12222, 1938, 1288, 1464, 13, 128001]

这也可以通过 yaml 配置文件实现：

# In config
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3.1-8B/original/tokenizer.model
  max_seq_len: 8192

dataset:
  _component_: torchtune.datasets.text_completion_dataset
  source: json
  data_files: odyssey.json
  column: input
  split: train

`.txt` 格式¶

# odyssey.txt

After we were clear of the river Oceanus, and had got out into the open sea, we went on till we reached the Aeaean island where there is dawn and sunrise as in other places. We then drew our ship on to the sands and got out of her on to the shore, where we went to sleep and waited till day should break.
Then, when the child of morning, rosy-fingered Dawn, appeared, I sent some men to Circe's house to fetch the body of Elpenor. We cut firewood from a wood where the headland jutted out into the sea, and after we had wept over him and lamented him we performed his funeral rites. When his body and armour had been burned to ashes, we raised a cairn, set a stone over it, and at the top of the cairn we fixed the oar that he had been used to row with.

from torchtune.models.llama3 import llama3_tokenizer
from torchtune.datasets import text_completion_dataset

m_tokenizer = llama3_tokenizer(
    path="/tmp/Meta-Llama-3.1-8B/original/tokenizer.model",
    max_seq_len=8192
)

ds = text_completion_dataset(
    tokenizer=m_tokenizer,
    source="text",
    data_files="odyssey.txt",
    split="train",
)
# the outputs here are identical to above

同样，这也可以通过 yaml 配置文件实现：

# In config
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3.1-8B/original/tokenizer.model
  max_seq_len: 8192

dataset:
  _component_: torchtune.datasets.text_completion_dataset
  source: text
  data_files: odyssey.txt
  split: train

从 Hugging Face 加载文本补全数据集¶

要从 Hugging Face 加载文本补全数据集，您需要将数据集仓库名称传递给source。对于大多数 HF 数据集，您还需要指定split。

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import text_completion_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = text_completion_dataset(
    tokenizer=g_tokenizer,
    source="wikimedia/wikipedia",
    split="train",
)

# Tokenizer is passed into the dataset in the recipe so we don't need it here
dataset:
  _component_: torchtune.datasets.text_completion_dataset
  source: wikimedia/wikipedia
  split: train

内置文本补全数据集¶

cnn_dailymail_articles_dataset()

文档

访问 PyTorch 的全面开发人员文档

查看文档

教程

获取面向初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并解答您的问题

查看资源