概述¶

注意

上方图表仅作说明之用。并非所有框目前都已开箱即用。

此模块包含一组内置的TorchX组件。目录结构按组件类别进行组织。组件只是模板化的应用程序规范。可以将它们视为不同类型的作业定义的工厂方法。该模块中返回specs.AppDef的函数是我们所指的组件。

您可以在 torchx.components 模块中浏览组件库或在我们的文档页面上。

使用内置函数¶

一旦您找到了内置组件，您可以选择：

以作业形式运行组件
在工作流（管道）中使用该组件

在两种情况下，该组件都将作为一个作业运行，区别在于作业将以独立作业的形式直接在调度器上运行，或者作为工作流中的一个“阶段”，具有上游和/或下游依赖关系。

注意

根据组件的语义，任务可能是单节点或分布式。例如，如果组件只有一个角色且该角色的role.num_replicas == 1，那么任务就是单节点任务。如果组件有多个角色和/或任一角色的num_replicas > 1，那么任务就是多节点分布式任务。

不确定是否应该将该组件作为作业运行还是作为管道阶段运行？使用这条经验法则：

刚入门？通过运行它作为一个任务来熟悉该组件。
需要作业依赖关系吗？将组件作为管道阶段运行。
不需要作业依赖关系？作为作业运行该组件

作者¶

由于组件只是一个返回specs.AppDef的Python函数，编写自己的组件就像编写一个遵循以下规则的Python函数一样简单：

组件函数必须返回一个specs.AppDef，并且返回类型必须指定
All arguments of the component must be PEP 484 type annotated and the type must be one of
1. 基础类型：int, float, str, bool
2. 可选原语：Optional[int], Optional[float], Optional[str]
3. 基础函数映射: Dict[Primitive_key, Primitive_value]
4. 原始函数列表: List[Primitive_values]
5. 可选集合：Optional[List], Optional[Dict]
6. VAR_ARG: *arg (在将参数传递给入口脚本时很有用)
(可选) 一个以 google 格式编写的文档字符串（特别是参见 function_with_pep484_type_annotations）。这个文档字符串纯粹是为了提供信息，因为 torchx 命令行工具会使用它自动生成一个有信息性的 --help 消息，这对于与他人共享组件时非常有用。如果组件没有文档字符串， --help 选项仍然可以工作，但参数将有一个固定的描述（见下文）。请注意，当通过 torchx.runner 程序化运行组件时，torchx 完全不会捕获文档字符串。

以下是启动DDP脚本的一个示例组件，它是torchx.components.dist.ddp()内置功能的简化版本。

import os
import torchx.specs as specs

def ddp(
    *script_args: str,
    image: str,
    script: str,
    host: str = "aws_p3.2xlarge",
    nnodes: int = 1,
    nproc_per_node: int = 1,
) -> specs.AppDef:
   return specs.AppDef(
       name=os.path.basename(script),
       roles=[
           spec.Role(
               name="trainer",
               image=image,
               resource=specs.named_resources[host],
               num_replicas=nnodes,
               entrypoint="python",
               args=[
                   "-m",
                   "torch.distributed.run",
                   "--rdzv_backend=etcd",
                   "--rdzv_endpoint=localhost:5900",
                   f"--nnodes={nnodes}",
                   f"--nprocs_per_node={nprocs_per_node}",
                   "-m",
                   script,
                   *script_args,
               ],
           ),
       ]
   )

假设上述组件保存在 example.py 中，我们可以运行 --help 在它上面如下：

$ torchx ./example.py:ddp --help
usage: torchx run ...torchx_params... ddp  [-h] --image IMAGE --script SCRIPT [--host HOST]
                                          [--nnodes NNODES] [--nproc_per_node NPROC_PER_NODE]
                                          ...

AppDef: ddp. TIP: improve this help string by adding a docstring ...<omitted for brevity>...

positional arguments:
  script_args           (required)

optional arguments:
  -h, --help            show this help message and exit
  --image IMAGE         (required)
  --script SCRIPT       (required)
  --host HOST           (default: aws_p3.2xlarge)
  --nnodes NNODES       (default: 1)
  --nproc_per_node NPROC_PER_NODE
                        (default: 1)

如果我们包含这样的文档字符串：

def ddp(...) -> specs.AppDef:
  """
  DDP Simplified.

  Args:
     image: name of the docker image containing the script + deps
     script: path of the script in the image
     script_args: arguments to the script
     host: machine type (one from named resources)
     nnodes: number of nodes to launch
     nproc_per_node: number of scripts to launch per node

  """

  # ... component body same as above ...
  pass

然后 --help 消息将反映文档字符串中的函数和参数描述如下所示：

usage: torchx run ...torchx_params... ddp  [-h] --image IMAGE --script SCRIPT [--host HOST]
                                          [--nnodes NNODES] [--nproc_per_node NPROC_PER_NODE]
                                          ...

App spec: DDP simplified.

positional arguments:
  script_args           arguments to the script

optional arguments:
  -h, --help            show this help message and exit
  --image IMAGE         name of the docker image containing the script + deps
  --script SCRIPT       path of the script in the image
  --host HOST           machine type (one from named resources)
  --nnodes NNODES       number of nodes to launch
  --nproc_per_node NPROC_PER_NODE
                        number of scripts to launch per node

验证¶

为了验证您是否正确定义了组件，您可以选择以下方式之一：

(最简单) 使用命令行工具对组件进行试运行：--help torchx run --dryrun ~/component.py:train --help
使用组件 linter (参见 dist_test.py 作为示例)

作为作业运行¶

您可以使用 torchx cli 或编程方式使用 torchx.runner 运行组件作为作业。两者是相同的，实际上 cli 在幕后使用 runner，因此选择权在您手中。快速入门指南将引导您完成基础知识，以便您开始。

程序化运行¶

要以编程方式运行内置组件或您自己的组件，只需像调用常规Python函数一样调用该组件，并将其传递给torchx.runner。以下是调用utils.echo内置组件的一个示例：

from torchx.components.utils import echo
from torchx.runner import get_runner

get_runner().run(echo(msg="hello world"), scheduler="local_cwd")

CLI 运行（内置功能）¶

在从命令行界面运行组件时，您需要传递要调用的组件函数。对于内置组件，其形式为 {component_module}.{component_fn}，其中 {component_module} 是相对于 torchx.components 的组件模块路径，而 {component_fn} 是该模块中的组件函数。因此对于 torchx.components.utils.echo，我们将去掉 torchx.components 前缀并以如下方式运行它

$ torchx run utils.echo --msg "hello world"

请参阅 CLI文档以获取更多信息。

CLI 运行 (自定义)¶

要使用CLI运行自定义组件，您需要使用稍微不同的语法形式{component_path}:{component_fn}。其中{component_path}是您的组件Python文件的文件路径，而{component_fn}是该文件中组件函数的名称。假设您的组件在/home/bob/component.py中，并且组件函数名为train()，则应按如下方式运行

# option 1. use absolute path
$ torchx run /home/bob/component.py:train --help

# option 2. let the shell do the expansion
$ torchx run ~/component.py:train --help

# option 3. same but after CWD to $HOME
$ cd ~/
$ torchx run ./component.py:train --help

# option 4. files can be relative to CWD
$ cd ~/
$ torchx run component.py:train --help

注意

内置功能也可以这样运行，前提是你知道TorchX的安装目录！

从命令行传递组件参数¶

由于组件只是简单的Python函数，因此使用它们进行编程非常直接。如上所示，通过CLI的run子命令运行组件时，组件参数作为程序参数传递，使用双破折号 + 参数名的语法（例如--param1=1或--param1 1）。 CLI根据组件的文档字符串自动生成argparse解析器。下面是关于如何传递各种类型的组件参数的总结，假设组件定义如下：

# in comp.py
from typing import Dict, List
import torchx.specs as specs

def f(i: int, f: float, s: str, b: bool, l: List[str], d: Dict[str, str], *args) -> specs.AppDef:
   """
   Example component

   Args:
       i: int param
       f: float param
       s: string param
       b: bool param
       l: list param
       d: map param
       args: varargs param

   Returns: specs.AppDef
   """

   pass

帮助: torchx run comp.py:f --help
基础类型 (int, float, str): torchx run comp.py:f --i 1 --f 1.2 --s "bar"
Bool: torchx run comp.py:f --b True (或 --b False)
地图: torchx run comp.py:f --d k1=v1,k2=v2,k3=v3
列表: torchx run comp.py:f --l a,b,c
VAR_ARG: *args 作为位置参数而不是参数传递，因此它们在命令末尾指定。使用 -- 分隔符开始 VAR_ARGS 部分。这在组件和应用程序具有相同参数或传递 --help 参数时很有用。以下是一些示例： * *args=["arg1", "arg2", "arg3"]: torchx run comp.py:f --i 1 arg1 arg2 arg3 * *args=["--flag", "arg1"]: torchx run comp.py:f --i 1 --flag arg1 `` * ``*args=["--help"]: torchx run comp.py:f -- --help * *args=["--i", "2"]: torchx run comp.py:f --i 1 -- --i 2

在流水线中运行¶

torchx.pipelines 定义了将 torchx 组件转换为目标管道平台中的管道“阶段”的适配器（请参阅 Pipeline Adapters 以获取支持的管道编排器列表）。

附加资源¶

See:

本模块中定义的组件作为说明性示例
定义你自己的组件快速入门指南
组件最佳实践指南
应用最佳实践指南

概述¶