基础¶

这些组件旨在用作构造时的便捷方法其他组件。基础组件库中的许多方法都是工厂方法 for 、、和挂接到 TorchX 的可配置扩展点。RoleContainerResources

torchx.components.base 中。named_resource（资源：str） → torchx.specs.api.Resource [来源]¶

根据通过 entrypoints.txt 注册的字符串定义获取资源对象。

Torchx 实现了命名资源注册机制，包括以下步骤：

创建模块并定义资源检索函数：

# my_module.resources
from typing import Dict
from torchx.specs import Resource

def gpu_x_1() -> Dict[str, Resource]:
    return Resource(cpu=2, memMB=64 * 1024, gpu = 2)

在 entrypoints 部分中注册资源检索：

[torchx.schedulers]
gpu_x_1 = my_module.resources:gpu_x_1

可以用作此函数的字符串参数：gpu_x_1

resource = named_resource("gpu_x_1")

torchx.components.base.binary_component。binary_component（name： str， image： str， entrypoint： str， args：可选[List[str]] = 无，env：可选[Dict[str， str]] = 无，资源：torchx.specs.api.Resource = Resource（cpu=- 1， gpu=- 1， memMB=- 1， capabilities={}）） → torchx.specs.api.AppDef [来源]¶

binary_component 会从提供的参数。

>>> from torchx.components.base.binary_component import binary_component
>>> binary_component(
...     name="datapreproc",
...     image="pytorch/pytorch:latest",
...     entrypoint="python3",
...     args=["--version"],
...     env={
...         "FOO": "bar",
...     },
... )
AppDef(name='datapreproc', ...)

torchx.components.base.roles 中。create_torch_dist_role（name： str， image： str， entrypoint： str， resource： torchx.specs.api.Resource = Resource（cpu=-1， gpu=-1， memMB=-1， capabilities={}）， base_image：可选[str] = 无，script_args：可选[List[str]] = 无，script_envs：可选[Dict[str， str]] = 无，num_replicas：int = 1，max_retries：int = 0，port_map：dict[str， int] = Field（name=None， type=None， default=<dataclasses._MISSING_TYPE object>， default_factory=<class 'dict'>， init=True， repr=True， hash=None， compare=True， metadata=mappingproxy（{}）， _field_type=None）， retry_policy： torchx.specs.api.RetryPolicy = <RetryPolicy.APPLICATION： 'APPLICATION'>， **launch_kwargs： Any） → torchx.specs.api.Role [来源]¶

用户为其提供的 A 执行 torchelastic 代理（在容器中）。请注意，torchelastic 代理调用的多个副本。Roleentrypointentrypoint

有关 torchelastic 的更多信息，请参阅 torchelastic 快速入门文档。

重要

用户有责任确保角色的映像包括 TorchElastic。由于 Torchx 没有控制镜像的构建过程，它不能自动将 torchelastic 包含在角色的映像中。

以下示例启动 2 个（节点）的弹性，允许在 2 到 4 个节点之间扩展。每个节点运行 8 个允许的工作程序失败并重启最多 3 次。replicasmy_train_script.py

警告

replicas必须是介于（含）之间的整数。也就是说，是无效的，并且将导致未定义的行为。nnodesRole("trainer", nnodes="2:4", num_replicas=5)

>>> from torchx.components.base.roles import create_torch_dist_role
>>> from torchx.specs.api import NULL_RESOURCE
>>> elastic_trainer = create_torch_dist_role(
...     name="trainer",
...     image="<NONE>",
...     resource=NULL_RESOURCE,
...     entrypoint="my_train_script.py",
...     script_args=["--script_arg", "foo", "--another_arg", "bar"],
...     num_replicas=4, max_retries=1,
...     nproc_per_node=8, nnodes="2:4", max_restarts=3)
... # effectively runs:
... #    python -m torch.distributed.launch
... #        --nproc_per_node 8
... #        --nnodes 2:4
... #        --max_restarts 3
... #        my_train_script.py --script_arg foo --another_arg bar
>>> elastic_trainer
Role(name='trainer', ...)

参数

name （名称） – 角色的名称
entrypoint – 将启动的用户二进制文件或 Python 脚本。
resource – 计划程序请求的资源
base_image – 可选基础映像（如果计划程序支持映像叠加）
script_args – 用户提供的参数
script_envs – 环境。将在运行 EntryPoint 的工作进程上设置的变量
num_replicas – 要运行的角色副本数
max_retries – 最大重试次数
port_map – 角色的端口映射
retry_policy – 应用于角色的重试策略
launch_kwargs – kwarg 样式的启动参数，将用于启动 torchelastic 代理。

返回

Role 对象，通过 torchelastic 作为代理启动用户入口点

基础¶

文档

教程

资源