Base¶

These components are meant to be used as convenience methods when constructing other components. Many methods in the base component library are factory methods for Role, Container, and Resources that are hooked up to TorchX’s configurable extension points.

torchx.components.base.binary_component.binary_component(name: str, image: str, entrypoint: str, args: Optional[List[str]] = None, env: Optional[Dict[str, str]] = None, resource: torchx.specs.api.Resource = Resource(cpu=- 1, gpu=- 1, memMB=- 1, capabilities={})) → torchx.specs.api.AppDef [source]¶

binary_component creates a single binary component from the provided arguments.

>>> from torchx.components.base.binary_component import binary_component
>>> binary_component(
...     name="datapreproc",
...     image="pytorch/pytorch:latest",
...     entrypoint="python3",
...     args=["--version"],
...     env={
...         "FOO": "bar",
...     },
... )
AppDef(name='datapreproc', ...)

torchx.components.base.roles.create_torch_dist_role(name: str, image: str, entrypoint: str, resource: torchx.specs.api.Resource = Resource(cpu=-1, gpu=-1, memMB=-1, capabilities={}), base_image: Optional[str] = None, args: Optional[List[str]] = None, env: Optional[Dict[str, str]] = None, num_replicas: int = 1, max_retries: int = 0, port_map: Dict[str, int] = Field(name=None, type=None, default=<dataclasses._MISSING_TYPE object>, default_factory=<class 'dict'>, init=True, repr=True, hash=None, compare=True, metadata=mappingproxy({}), _field_type=None), retry_policy: torchx.specs.api.RetryPolicy = <RetryPolicy.APPLICATION: 'APPLICATION'>, **launch_kwargs: Any) → torchx.specs.api.Role [source]¶

A Role for which the user provided entrypoint is executed with the torchelastic agent (in the container). Note that the torchelastic agent invokes multiple copies of entrypoint.

For more information about torchelastic see torchelastic quickstart docs.

Important

It is the responsibility of the user to ensure that the role’s image includes torchelastic. Since Torchx has no control over the build process of the image, it cannot automatically include torchelastic in the role’s image.

The following example launches 2 replicas (nodes) of an elastic my_train_script.py that is allowed to scale between 2 to 4 nodes. Each node runs 8 workers which are allowed to fail and restart a maximum of 3 times.

Warning

replicas MUST BE an integer between (inclusive) nnodes. That is, Role("trainer", nnodes="2:4", num_replicas=5) is invalid and will result in undefined behavior.

>>> from torchx.components.base.roles import create_torch_dist_role
>>> from torchx.specs.api import NULL_RESOURCE
>>> elastic_trainer = create_torch_dist_role(
...     name="trainer",
...     image="<NONE>",
...     resource=NULL_RESOURCE,
...     entrypoint="my_train_script.py",
...     args=["--script_arg", "foo", "--another_arg", "bar"],
...     num_replicas=4, max_retries=1,
...     nproc_per_node=8, nnodes="2:4", max_restarts=3)
... # effectively runs:
... #    python -m torch.distributed.launch
... #        --nproc_per_node 8
... #        --nnodes 2:4
... #        --max_restarts 3
... #        my_train_script.py --script_arg foo --another_arg bar
>>> elastic_trainer
Role(name='trainer', ...)

Parameters

name – Name of the role
entrypoint – User binary or python script that will be launched.
resource – Resource that is requested by scheduler
base_image – Optional base image, if schedulers support image overlay
args – User provided arguments
env – Env. variables that will be set on worker process that runs entrypoint
num_replicas – Number of role replicas to run
max_retries – Max number of retries
port_map – Port mapping for the role
retry_policy – Retry policy that is applied to the role
launch_kwargs – kwarg style launch arguments that will be used to launch torchelastic agent.

Returns

Role object that launches user entrypoint via the torchelastic as proxy

Base¶

文档

教程

资源