关于配置¶

此深入探讨将指导您编写用于运行配方的配置。

本次深入探讨将涵盖什么

如何编写 YAML 配置并使用它运行配方
如何使用 APIinstantiateparse
如何有效地使用配置和 CLI 覆盖来运行配方

先决条件

参数在哪里？¶

配置参数有两个主要入口点：configs 和 CLI overrides。配置是定义所有参数。他们是单身的重现运行的事实来源。config 参数可以在命令行使用进行快速更改和实验，而无需修改配置。tune

编写配置¶

配置用作在 torchtune 中运行配方的主要入口点。他们是预期为 YAML 文件，它们只是列出要定义的参数的值对于特定的运行。

seed: null
shuffle: True
device: cuda
dtype: fp32
enable_fsdp: True
...

配置组件`instantiate`¶

许多字段需要使用关联关键字指定 torchtune 对象参数作为参数。模型、数据集、优化器和损失函数是常见示例。您可以使用 subfield 轻松执行此作。在中，您需要指定对象的点路径您希望在 recipe 中实例化。点路径是您将使用的确切路径以正常导入 Python 文件中的对象。例如，要指定_component__component_alpaca_dataset在您的配置中使用自定义参数：

dataset:
  _component_: torchtune.datasets.alpaca_dataset
  train_on_input: False

在这里，我们将 from 的默认值更改为。train_on_inputTrueFalse

在配置中指定后，您可以创建一个实例，如下所示：_component_

from torchtune import config

# Access the dataset field and create the object instance
dataset = config.instantiate(cfg.dataset)

这将自动使用在下的字段中指定的任何关键字参数。dataset

如前所述，前面的示例实际上会引发一个错误。如果您查看alpaca_dataset, 你会注意到我们缺少一个必需的位置参数，即 tokenizer。由于这是另一个可配置的 torchtune 对象，让我们了解如何处理这通过查看instantiate()应用程序接口。

def instantiate(
    config: DictConfig,
    *args: Any,
    **kwargs: Any,
)

instantiate()也接受位置参数和关键字参数，并在创建对象。这意味着我们不仅可以传入分词器，还可以添加额外的如果我们愿意，请在配置中未指定关键字参数：

# Tokenizer is needed for the dataset, configure it first
tokenizer:
  _component_: torchtune.models.llama2.llama2_tokenizer
  path: /tmp/tokenizer.model

dataset:
  _component_: torchtune.datasets.alpaca_dataset

# Note the API of the tokenizer we specified - we need to pass in a path
def llama2_tokenizer(path: str) -> Llama2Tokenizer:

# Note the API of the dataset we specified - we need to pass in a model tokenizer
# and any optional keyword arguments
def alpaca_dataset(
    tokenizer: ModelTokenizer,
    train_on_input: bool = True,
    max_seq_len: int = 512,
) -> SFTDataset:

from torchtune import config

# Since we've already specified the path in the config, we don't need to pass
# it in
tokenizer = config.instantiate(cfg.tokenizer)
# We pass in the instantiated tokenizer as the first required argument, then
# we change an optional keyword argument
dataset = config.instantiate(
    cfg.dataset,
    tokenizer,
    train_on_input=False,
)

请注意，其他关键字参数将覆盖 config。

使用插值引用其他配置字段¶

有时，您需要对多个字段多次使用相同的值。你可以使用插值来引用另一个字段，并且instantiate()将自动为您解决。

output_dir: /tmp/alpaca-llama2-finetune
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ${output_dir}

验证您的配置¶

我们提供了一个方便的 CLI 实用程序 tune validate，以快速验证您的配置格式正确，所有组件都可以正确实例化。你如果您想测试将要运行的确切命令，也可以传入 overrides 你的实验与。如果任何参数格式不正确，则 tune validate 将列出发现错误的所有位置。

tune cp llama2/7B_lora_single_device ./my_config.yaml
tune validate ./my_config.yaml

编写配置的最佳实践¶

让我们讨论一些编写配置的准则，以充分利用它们。

气密配置¶

虽然在配置中放入尽可能多的内容以为您提供为您的实验切换参数提供最大的灵活性，我们鼓励您只能在配置中包含将在示例。这可确保完全清楚运行配方的选项，并将使其更容易调试。

# dont do this
alpaca_dataset:
  _component_: torchtune.datasets.alpaca_dataset
slimorca_dataset:
  ...

# do this
dataset:
  # change this in config or override when needed
  _component_: torchtune.datasets.alpaca_dataset

仅使用公共 API¶

如果要在配置中指定的组件位于私有文件中，请使用配置中的 public dotpath。这些组件通常公开在其父模块的文件。这样，您可以保证稳定性的 API。您的组件 DotPath 的 DotPath 中。__init__.py

# don't do this
dataset:
  _component_: torchtune.datasets._alpaca.alpaca_dataset

# do this
dataset:
  _component_: torchtune.datasets.alpaca_dataset

命令行覆盖¶

配置是收集所有参数以运行配方的主要位置，但有时您可能希望快速尝试不同的值，而不必更新配置本身。要启用快速试验，您可以指定覆盖值通过命令添加到配置中的参数。应指定这些作为键值对tunek1=v1 k2=v2 ...

例如，要使用自定义 model 和 tokenizer 目录运行 LoRA 单设备微调配方，您可以提供覆盖：

tune run lora_finetune_single_device \
--config llama2/7B_lora_single_device \
checkpointer.checkpoint_dir=/home/my_model_checkpoint \
checkpointer.checkpoint_files=['file_1','file_2'] \
tokenizer.path=/home/my_tokenizer_path

覆盖组件¶

如果要覆盖实例化的配置中的类或函数通过字段，您可以通过将 name 直接命名。组件中的任何嵌套字段都可以用点表示法覆盖。_component_

dataset:
  _component_: torchtune.datasets.alpaca_dataset

# Change to slimorca_dataset and set train_on_input to True
tune run lora_finetune_single_device --config my_config.yaml \
dataset=torchtune.datasets.slimorca_dataset dataset.train_on_input=True

删除配置字段¶

更改组件时，您可能需要从配置中删除某些参数通过需要不同关键字参数的覆盖。您可以通过使用的 ~ 标志并指定要删除的 config 字段的点路径。例如，如果你想覆盖一个内置配置并使用 bitsandbytes.optim.PagedAdamW8bit 优化器，你可能需要删除像特定于 PyTorch 优化器。请注意，此示例要求您已安装 bitsandbytes。foreach

# In configs/llama3/8B_full.yaml
optimizer:
  _component_: torch.optim.AdamW
  lr: 2e-5
  foreach: False

# Change to PagedAdamW8bit and remove fused, foreach
tune run --nproc_per_node 4 full_finetune_distributed --config llama3/8B_full \
optimizer=bitsandbytes.optim.PagedAdamW8bit ~optimizer.foreach

关于配置¶

参数在哪里？¶

编写配置¶

配置组件`instantiate`¶

使用插值引用其他配置字段¶

验证您的配置¶

编写配置的最佳实践¶

气密配置¶

仅使用公共 API¶

命令行覆盖¶

覆盖组件¶

删除配置字段¶

文档

教程

资源

关于配置¶

参数在哪里？¶

编写配置¶

配置组件instantiate¶

使用插值引用其他配置字段¶

验证您的配置¶

编写配置的最佳实践¶

气密配置¶

仅使用公共 API¶

命令行覆盖¶

覆盖组件¶

删除配置字段¶

文档

教程

资源

配置组件`instantiate`¶