目录

Datasets Overview

torchtune lets you fine-tune LLMs and VLMs using any dataset found on Hugging Face Hub, downloaded locally, or on a remote url. We provide built-in dataset builders to help you quickly bootstrap your fine-tuning project for workflows including instruct tuning, preference alignment, continued pretraining, and more. Beyond those, torchtune enables full customizability on your dataset pipeline, letting you train on any data format or schema.

The following tasks are supported:

Data pipeline

../_images/torchtune_datasets.svg

From raw data samples to the model inputs in the training recipe, all torchtune datasets follow the same pipeline:

  1. Raw data is queried one sample at a time from a Hugging Face dataset, local file, or remote file

  2. Message Transforms convert the raw sample which can take any format into a list of torchtune Messages. Images are contained in the message object they are associated with.

  3. Multimodal Transforms applies model-specific transforms to the messages, including tokenization (see Tokenizers), prompt templating (see Prompt Templates), image transforms, and anything else required for that particular model.

  4. The collater packages the processed samples together in a batch and the batch is passed into the model during training.

文档

访问 PyTorch 的全面开发人员文档

查看文档

教程

获取面向初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并解答您的问题

查看资源