目录

目录

torchtune.datasets¶

For a detailed general usage guide, please see our datasets tutorial.

Example datasets¶

torchtune supports several widely used datasets to help quickly bootstrap your fine-tuning.

`alpaca_dataset`	Support for family of Alpaca-style datasets from Hugging Face Datasets using the data input format and prompt template from the original alpaca codebase, where `instruction`, `input`, and `output` are fields from the dataset.
`alpaca_cleaned_dataset`	Builder for a variant of Alpaca-style datasets with the cleaned version of the original Alpaca dataset, yahma/alpaca-cleaned.
`grammar_dataset`	Support for grammar correction datasets and their variants from Hugging Face Datasets.
`samsum_dataset`	Support for summarization datasets and their variants from Hugging Face Datasets.
`slimorca_dataset`	Support for SlimOrca-style family of conversational datasets.
`stack_exchanged_paired_dataset`	Family of preference datasets similar to StackExchangePaired data.
`cnn_dailymail_articles_dataset`	Support for family of datasets similar to CNN / DailyMail, a corpus of news articles.
`wikitext_dataset`	Support for family of datasets similar to wikitext, an unstructured text corpus consisting of articles from Wikipedia.

Generic dataset builders¶

torchtune also supports generic dataset builders for common formats like chat models and instruct models. These are especially useful for specifying from a YAML config.

`instruct_dataset`	Build a configurable dataset with instruction prompts.
`chat_dataset`	Build a configurable dataset with conversations.
`text_completion_dataset`	Build a configurable dataset from a freeform, unstructured text corpus similar to datasets used in pre-training.

Generic dataset classes¶

Class representations for the above dataset builders.

`InstructDataset`	Class that supports any custom dataset with instruction-based prompts and a configurable template.
`ChatDataset`	Class that supports any custom dataset with multiturn conversations.
`TextCompletionDataset`	Freeform dataset for any unstructured text corpus.
`ConcatDataset`	A dataset class for concatenating multiple sub-datasets into a single dataset.
`PackedDataset`	Performs greedy sample packing on a provided dataset.
`PreferenceDataset`	Class that supports any custom dataset with instruction-based prompts and a configurable template.

文档

访问 PyTorch 的全面开发人员文档

查看文档

教程

获取面向初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并解答您的问题

查看资源