高级用法¶

TorchX 定义了插件点，以便您配置 TorchX 以获得最佳支持您的基础设施设置。大部分配置是通过 Python 的入口点。

注意

入口点需要安装包含它们的 python 包。如果您没有 python 包，我们建议您创建一个，以便共享您的团队和组织中的资源定义、调度程序和组件。

下面描述的入口点可以在项目的 setup.py 文件中指定为

from setuptools import setup

setup(
    name="project foobar",
    entry_points={
        "torchx.schedulers": [
            "my_scheduler = my.custom.scheduler:create_scheduler",
        ],
        "torchx.named_resources": [
            "gpu_x2 = my_module.resources:gpu_x2",
        ],
    }
)

注册自定义调度程序¶

您可以通过实现 ..py：：class torchx.schedulers.Scheduler 接口。

该函数应具有以下函数签名：create_scheduler

from torchx.schedulers import Scheduler

def create_scheduler(session_name: str, **kwargs: object) -> Scheduler:
    return MyScheduler(session_name, **kwargs)

然后，您可以通过添加 entry_points 定义来注册此自定义调度程序添加到您的 Python 项目中。

# setup.py
...
entry_points={
    "torchx.schedulers": [
        "my_scheduler = my.custom.scheduler:create_schedule",
    ],
}

注册命名资源¶

命名资源是一组预定义的资源规范，它们被赋予了 string name 的 S S S T这特别有用当您的集群具有一组固定的实例类型时。例如，如果您的 AWS 上的深度学习训练 Kubernetes 集群是仅由 p3.16xlarge（64 个 vCPU、8 个 GPU、488GB）组成，那么您可能需要列举容器的 T 恤大小的资源规格，如下所示：

from torchx.specs import Resource

def gpu_x1() -> Resource:
    return Resource(cpu=8,  gpu=1, memMB=61_000)

def gpu_x2() -> Resource:
    return Resource(cpu=16, gpu=2, memMB=122_000)

def gpu_x3() -> Resource:
    return Resource(cpu=32, gpu=4, memMB=244_000)

def gpu_x4() -> Resource:
    return Resource(cpu=64, gpu=8, memMB=488_000)

要使这些资源定义可用，您需要通过 entry_points：

# setup.py
...
entry_points={
    "torchx.named_resources": [
        "gpu_x2 = my_module.resources:gpu_x2",
    ],
}

安装具有 entry_points 定义的软件包后，命名的然后，可以按以下方式使用 resource：

>>> from torchx.specs import get_named_resources
>>> get_named_resources("gpu_x2")
Resource(cpu=16, gpu=2, memMB=122000, ...)

# my_module.component
from torchx.specs import AppDef, Role, get_named_resources

def test_app(resource: str) -> AppDef:
    return AppDef(name="test_app", roles=[
        Role(
            name="...",
            image="...",
            resource=get_named_resources(resource),
        )
    ])

test_app("gpu_x2")

注册自定义组件¶

您可以使用 CLI 编写和注册一组自定义组件作为 CLI 的内置组件。这使得定制成为可能一组与您的团队或组织和支持最相关的组件它作为 CLI 。这样，用户将看到您的自定义组件当他们运行时torchxbuiltin

$ torchx builtins

自定义组件可以通过入口点注册。如果具有以下目录结构：[torchx.components]my_project.bar

$PROJECT_ROOT/my_project/bar/
    |- baz.py

并且有一个名为：baz.pytrainer

# baz.py
import torchx.specs as specs

def trainer(...) -> specs.AppDef: ...

入口点被添加为：

# setup.py
...
entry_points={
    "torchx.components": [
        "foo = my_project.bar",
    ],
}

TorchX 将在模块中搜索所有已定义的组件，并将找到的组件分组 components 的在这种情况下，组件将使用名称注册。my_project.barfoo.*my_project.bar.baz.trainerfoo.baz.trainer

注意

仅 python 包（带有文件的目录），并且 TorchX 不会尝试递归到命名空间包中（没有文件的目录）。但是，您可以注册顶级命名空间包。__init__.py__init__.py

torchxCLI 将通过以下方式显示已注册的组件：

$ torchx builtins
Found 1 builtin components:
1. foo.baz.trainer

然后，自定义组件可以用作：

$ torchx run foo.baz.trainer -- --name "test app"

当您注册自己的组件时，TorchX 将不包含自己的内置函数。要添加 TorchX 的 builtin 组件中，您必须将另一个条目指定为：

# setup.py
...
entry_points={
    "torchx.components": [
        "foo = my_project.bar",
        "torchx = torchx.components",
    ],
}

这将添加回 TorchX 内置函数，但带有组件名称前缀（例如与默认值）相比。torchx.*torchx.dist.ddpdist.ddp

如果有两个注册表项指向同一个组件，例如

# setup.py
...
entry_points={
    "torchx.components": [
        "foo = my_project.bar",
        "test = my_project",
    ],
}

对于具有不同的前缀别名：和 .具体my_project.barfoo.*test.bar.*

$ torchx builtins
Found 2 builtin components:
1. foo.baz.trainer
2. test.bar.baz.trainer

要省略分组并使组件名称更短，请使用下划线（例如 or 、等）。例如：__0_1

# setup.py
...
entry_points={
    "torchx.components": [
        "_0 = my_project.bar",
        "_1 = torchx.components",
    ],
}

这具有将 trainer 组件公开为（而不是）的效果并添加回内置组件，就像在 torchx 的原版安装中一样，不带前缀。baz.trainerfoo.baz.trainertorchx.*

$ torchx builtins
Found 11 builtin components:
1. baz.trainer
2. dist.ddp
3. utils.python
4. ... <more builtins from torchx.components.* ...>

高级用法¶

注册自定义调度程序¶

注册命名资源¶

注册自定义组件¶

文档

教程

资源