配置¶

TorchX 定义了插件点，以便您配置 TorchX 以获得最佳支持您的基础设施设置。大部分配置是通过 Python 的入口点<https://packaging.python.org/specifications/entry-points/>.

下面描述的入口点可以在项目的 setup.py 文件中指定为

from setuptools import setup

setup(
   name="project foobar",
   entry_points={
       "$ep_group_name" : [
           "$ep_name = $module:$function"
       ],
   }
)

注册自定义调度程序¶

您可以通过实现 ..py：：class torchx.schedulers.Scheduler 接口。一旦你这样做了，你就可以使用以下入口点注册此自定义计划程序

[torchx.schedulers]
$sched_name = my.custom.scheduler:create_scheduler

该函数应具有以下函数签名：create_scheduler

from torchx.schedulers import Scheduler

def create_scheduler(session_name: str, **kwargs: Any) -> Scheduler:
    return MyScheduler(session_name, **kwargs)

注册命名资源¶

命名资源是一组预定义的资源规范，它们被赋予了 string name 的 S S S T这特别有用当您的集群具有一组固定的实例类型时。例如，如果您的 AWS 上的深度学习训练 Kubernetes 集群是仅由 p3.16xlarge（64 个 vCPU、8 个 GPU、488GB）组成，那么您可能需要列举容器的 T 恤大小的资源规格，如下所示：

{
   "gpu_x_1" : "Resource(cpu=8,  gpu=1, memMB=61*GB),
   "gpu_x_2" : "Resource(cpu=16, gpu=2, memMB=122*GB),
   "gpu_x_4" : "Resource(cpu=32, gpu=4, memMB=244*GB),
   "gpu_x_8" : "Resource(cpu=64, gpu=8, memMB=488*GB),
}

并通过字符串名称引用资源。

Torchx 支持自定义资源定义和字符串引用表示法：

[torchx.named_resources]
gpu_x_2 = my_module.resources:gpu_x_2

之后的命名资源可以按以下方式使用：

# my_module.component
from torchx.specs import AppDef, Role, named_resources
from torchx.components.base import torch_dist_role

app = AppDef(name="test_app", roles=[Role(.., resource=named_resources("gpu_x_2"))])

或使用：torch.distributed.run

# my_module.component
from torchx.specs import AppDef, Role
from torchx.components.base import torch_dist_role

app = AppDef(name="test_app", roles=[torch_dist_role(.., resource="gpu_x_2", ...)])

注册自定义组件¶

可以使用 CLI 编写和注册一组自定义组件作为 CLI 的内置组件。这使得定制成为可能一组与您的团队或组织和支持最相关的组件它作为 CLI 。这样，用户将看到您的自定义组件当他们运行时torchxbuiltin

$ torchx builtins

自定义组件可以通过以下端点注册：

[torchx.components]
custom_component = my_module.components:my_component

自定义组件可以按以下方式执行：

$ torchx run --scheduler local --scheduler_args image_fetcher=...,root_dir=/tmp custom_component -- --name "test app"

配置¶

注册自定义调度程序¶

注册命名资源¶

注册自定义组件¶

文档

教程

资源