目录

torchtune.datasets

For a detailed general usage guide, please see our datasets tutorial.

Example datasets

torchtune supports several widely used datasets to help quickly bootstrap your fine-tuning.

alpaca_dataset

Support for family of Alpaca-style datasets from Hugging Face Datasets using the data input format and prompt template from the original alpaca codebase, where instruction, input, and output are fields from the dataset.

alpaca_cleaned_dataset

Builder for a variant of Alpaca-style datasets with the cleaned version of the original Alpaca dataset, yahma/alpaca-cleaned.

grammar_dataset

Support for grammar correction datasets and their variants from Hugging Face Datasets.

samsum_dataset

Support for summarization datasets and their variants from Hugging Face Datasets.

slimorca_dataset

Support for SlimOrca-style family of conversational datasets.

stack_exchanged_paired_dataset

Family of preference datasets similar to StackExchangePaired data.

cnn_dailymail_articles_dataset

Support for family of datasets similar to CNN / DailyMail, a corpus of news articles.

wikitext_dataset

Support for family of datasets similar to wikitext, an unstructured text corpus consisting of articles from Wikipedia.

Generic dataset builders

torchtune also supports generic dataset builders for common formats like chat models and instruct models. These are especially useful for specifying from a YAML config.

instruct_dataset

Build a configurable dataset with instruction prompts.

chat_dataset

Build a configurable dataset with conversations.

text_completion_dataset

Build a configurable dataset from a freeform, unstructured text corpus similar to datasets used in pre-training.

Generic dataset classes

Class representations for the above dataset builders.

InstructDataset

Class that supports any custom dataset with instruction-based prompts and a configurable template.

ChatDataset

Class that supports any custom dataset with multiturn conversations.

TextCompletionDataset

Freeform dataset for any unstructured text corpus.

ConcatDataset

A dataset class for concatenating multiple sub-datasets into a single dataset.

PackedDataset

Performs greedy sample packing on a provided dataset.

PreferenceDataset

Class that supports any custom dataset with instruction-based prompts and a configurable template.

文档

访问 PyTorch 的全面开发人员文档

查看文档

教程

获取面向初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并解答您的问题

查看资源