Advanced Usage¶

TorchX defines plugin points for you to configure TorchX to best support your infrastructure setup. Most of the configuration is done through Python’s entry points.

Note

Entry points requires a python package containing them be installed. If you don’t have a python package we recommend you make one so you can share your resource definitions, schedulers and components across your team and org.

The entry points described below can be specified in your project’s setup.py file as

from setuptools import setup

setup(
    name="project foobar",
    entry_points={
        "torchx.schedulers": [
            "my_scheduler = my.custom.scheduler:create_scheduler",
        ],
        "torchx.named_resources": [
            "gpu_x2 = my_module.resources:gpu_x2",
        ],
    }
)

Registering Custom Schedulers¶

You may implement a custom scheduler by implementing the .. py::class torchx.schedulers.Scheduler interface.

The create_scheduler function should have the following function signature:

from torchx.schedulers import Scheduler

def create_scheduler(session_name: str, **kwargs: object) -> Scheduler:
    return MyScheduler(session_name, **kwargs)

You can then register this custom scheduler by adding an entry_points definition to your python project.

# setup.py
...
entry_points={
    "torchx.schedulers": [
        "my_scheduler = my.custom.scheduler:create_schedule",
    ],
}

Registering Named Resources¶

A Named Resource is a set of predefined resource specs that are given a string name. This is particularly useful when your cluster has a fixed set of instance types. For instance if your deep learning training kubernetes cluster on AWS is comprised only of p3.16xlarge (64 vcpu, 8 gpu, 488GB), then you may want to enumerate t-shirt sized resource specs for the containers as:

from torchx.specs import Resource

def gpu_x1() -> Resource:
    return Resource(cpu=8,  gpu=1, memMB=61_000)

def gpu_x2() -> Resource:
    return Resource(cpu=16, gpu=2, memMB=122_000)

def gpu_x3() -> Resource:
    return Resource(cpu=32, gpu=4, memMB=244_000)

def gpu_x4() -> Resource:
    return Resource(cpu=64, gpu=8, memMB=488_000)

To make these resource definitions available you then need to register them via entry_points:

# setup.py
...
entry_points={
    "torchx.named_resources": [
        "gpu_x2 = my_module.resources:gpu_x2",
    ],
}

Once you install the package with the entry_points definitions, the named resource can then be used in the following manner:

>>> from torchx.specs import get_named_resources
>>> get_named_resources("gpu_x2")
Resource(cpu=16, gpu=2, memMB=122000, ...)

# my_module.component
from torchx.specs import AppDef, Role, get_named_resources

def test_app(resource: str) -> AppDef:
    return AppDef(name="test_app", roles=[
        Role(
            name="...",
            image="...",
            resource=get_named_resources(resource),
        )
    ])

test_app("gpu_x2")

Registering Custom Components¶

It is possible to author and register a custom set of components with the torchx CLI as builtins to the CLI. This makes it possible to customize a set of components most relevant to your team or organization and support it as a CLI builtin. This way users will see your custom components when they run

$ torchx builtins

Custom components can be registered via the following modification of the entry_points:

# setup.py
...
entry_points={
    "torchx.components": [
        "foo = my_project.bar",
    ],
}

The line above registers a group foo that is associated with the module my_project.bar. TorchX will recursively traverse lowest level dir associated with the my_project.bar and will find all defined components.

Note

If there are two registry entries, e.g. foo = my_project.bar and test = my_project there will be two sets of overlapping components with different aliases.

After registration, torchx cli will display registered components via:

$ torchx builtins

If my_project.bar had the following directory structure:

$PROJECT_ROOT/my_project/bar/
    |- baz.py

And baz.py defines a component (function) called trainer. Then the component can be run as a job in the following manner:

$ torchx run foo.baz.trainer -- --name "test app"

Advanced Usage¶

Registering Custom Schedulers¶

Registering Named Resources¶

Registering Custom Components¶

文档

教程

资源