Configuring¶

TorchX defines plugin points for you to configure TorchX to best support your infrastructure setup. Most of the configuration is done through Python’s entry points<https://packaging.python.org/specifications/entry-points/>.

The entry points described below can be specified in your project’s setup.py file as

from setuptools import setup

setup(
   name="project foobar",
   entry_points={
       "$ep_group_name" : [
           "$ep_name = $module:$function"
       ],
   }
)

Registering Custom Schedulers¶

You may implement a custom scheduler by implementing the .. py::class torchx.schedulers.Scheduler interface. Once you do you can register this custom scheduler with the following entrypoint

[torchx.schedulers]
$sched_name = my.custom.scheduler:create_scheduler

The create_scheduler function should have the following function signature:

from torchx.schedulers import Scheduler

def create_scheduler(session_name: str, **kwargs: Any) -> Scheduler:
    return MyScheduler(session_name, **kwargs)

Registering Named Resources¶

A Named Resource is a set of predefined resource specs that are given a string name. This is particularly useful when your cluster has a fixed set of instance types. For instance if your deep learning training kubernetes cluster on AWS is comprised only of p3.16xlarge (64 vcpu, 8 gpu, 488GB), then you may want to enumerate t-shirt sized resource specs for the containers as:

{
   "gpu_x_1" : "Resource(cpu=8,  gpu=1, memMB=61*GB),
   "gpu_x_2" : "Resource(cpu=16, gpu=2, memMB=122*GB),
   "gpu_x_4" : "Resource(cpu=32, gpu=4, memMB=244*GB),
   "gpu_x_8" : "Resource(cpu=64, gpu=8, memMB=488*GB),
}

And refer to the resources by their string names.

Torchx supports custom resource definition and reference by string representation:

[torchx.named_resources]
gpu_x_2 = my_module.resources:gpu_x_2

The named resource after that can be used in the following manner:

# my_module.component
from torchx.specs import AppDef, Role, named_resources
from torchx.components.base import torch_dist_role

app = AppDef(name="test_app", roles=[Role(.., resource=named_resources("gpu_x_2"))])

or using torch.distributed.run:

# my_module.component
from torchx.specs import AppDef, Role
from torchx.components.base import torch_dist_role

app = AppDef(name="test_app", roles=[torch_dist_role(.., resource="gpu_x_2", ...)])

Registering Custom Components¶

It is possible to author and register a custom set of components with the torchx CLI as builtins to the CLI. This makes it possible to customize a set of components most relevant to your team or organization and support it as a CLI builtin. This way users will see your custom components when they run

$ torchx builtins

Custom components can be registered via the following modification of the entry_points.txt:

[torchx.components]
foo = my_project.bar

The line above registers a group foo that is associated with the module my_project.bar. Torchx will recursively traverse lowest level dir associated with the my_project.bar and will find all defined components.

Note

If there are two registry entries, e.g. foo = my_project.bar and test = my_project there will be two sets of overlapping components with different aliases.

After registration, torchx cli will display registered components via:

$ torchx builtins

If my_project.bar had the following directory structure:

$PROJECT_ROOT/my_project/bar/
    |- baz.py

And baz.py defines a component (function) called trainer. Then the component can be run as a job in the following manner:

$ torchx run foo.baz.trainer -- --name "test app"

Configuring¶

Registering Custom Schedulers¶

Registering Named Resources¶

Registering Custom Components¶

文档

教程

资源