目录

Base

These components are meant to be used as convenience methods when constructing other components. Many methods in the base component library are factory methods for Role, Container, and Resources that are hooked up to TorchX’s configurable extension points.

torchx.components.base.binary_component.binary_component(name: str, image: str, entrypoint: str, args: Optional[List[str]] = None, env: Optional[Dict[str, str]] = None, resource: torchx.specs.api.Resource = Resource(cpu=- 1, gpu=- 1, memMB=- 1, capabilities={}))torchx.specs.api.AppDef[source]

binary_component creates a single binary component from the provided arguments.

>>> from torchx.components.base.binary_component import binary_component
>>> binary_component(
...     name="datapreproc",
...     image="pytorch/pytorch:latest",
...     entrypoint="python3",
...     args=["--version"],
...     env={
...         "FOO": "bar",
...     },
... )
AppDef(name='datapreproc', ...)
torchx.components.base.roles.create_torch_dist_role(name: str, image: str, entrypoint: str, resource: torchx.specs.api.Resource = Resource(cpu=-1, gpu=-1, memMB=-1, capabilities={}), base_image: Optional[str] = None, args: Optional[List[str]] = None, env: Optional[Dict[str, str]] = None, num_replicas: int = 1, max_retries: int = 0, port_map: Dict[str, int] = Field(name=None, type=None, default=<dataclasses._MISSING_TYPE object>, default_factory=<class 'dict'>, init=True, repr=True, hash=None, compare=True, metadata=mappingproxy({}), _field_type=None), retry_policy: torchx.specs.api.RetryPolicy = <RetryPolicy.APPLICATION: 'APPLICATION'>, **launch_kwargs: Any)torchx.specs.api.Role[source]

A Role for which the user provided entrypoint is executed with the torchelastic agent (in the container). Note that the torchelastic agent invokes multiple copies of entrypoint.

For more information about torchelastic see torchelastic quickstart docs.

Important

It is the responsibility of the user to ensure that the role’s image includes torchelastic. Since Torchx has no control over the build process of the image, it cannot automatically include torchelastic in the role’s image.

The following example launches 2 replicas (nodes) of an elastic my_train_script.py that is allowed to scale between 2 to 4 nodes. Each node runs 8 workers which are allowed to fail and restart a maximum of 3 times.

Warning

replicas MUST BE an integer between (inclusive) nnodes. That is, Role("trainer", nnodes="2:4", num_replicas=5) is invalid and will result in undefined behavior.

>>> from torchx.components.base.roles import create_torch_dist_role
>>> from torchx.specs.api import NULL_RESOURCE
>>> elastic_trainer = create_torch_dist_role(
...     name="trainer",
...     image="<NONE>",
...     resource=NULL_RESOURCE,
...     entrypoint="my_train_script.py",
...     args=["--script_arg", "foo", "--another_arg", "bar"],
...     num_replicas=4, max_retries=1,
...     nproc_per_node=8, nnodes="2:4", max_restarts=3)
... # effectively runs:
... #    python -m torch.distributed.launch
... #        --nproc_per_node 8
... #        --nnodes 2:4
... #        --max_restarts 3
... #        my_train_script.py --script_arg foo --another_arg bar
>>> elastic_trainer
Role(name='trainer', ...)
Parameters
  • name – Name of the role

  • entrypoint – User binary or python script that will be launched.

  • resource – Resource that is requested by scheduler

  • base_image – Optional base image, if schedulers support image overlay

  • args – User provided arguments

  • env – Env. variables that will be set on worker process that runs entrypoint

  • num_replicas – Number of role replicas to run

  • max_retries – Max number of retries

  • port_map – Port mapping for the role

  • retry_policy – Retry policy that is applied to the role

  • launch_kwargs – kwarg style launch arguments that will be used to launch torchelastic agent.

Returns

Role object that launches user entrypoint via the torchelastic as proxy

文档

访问 PyTorch 的全面开发人员文档

查看文档

教程

获取面向初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并解答您的问题

查看资源